Why your AI agent shouldn't touch raw SQL

Raw SQL handles, text-to-SQL, and ad-hoc tools all leave your AI agent one prompt away from a data incident. Here's why OrmAI exists, what threats it neutralizes, and what shape the right safety layer takes.

The promise that broke things

For two decades the database was a server you talked to in SQL. The shape of safety was simple: SQL was authored by humans, reviewed, parameterized, and shipped behind APIs. The application was a trust boundary; the database trusted the application.

Then we put an LLM inside the application. Suddenly the entity composing queries is non-deterministic, prompt-influenceable, and operating on behalf of users whose intent is conveyed in natural language. Every assumption that made SQL safe inside an app is now wrong.

Three bad defaults you'll be tempted to ship

Almost every team building an agent that needs to read data lands on one of three patterns. All three are wrong in production.

1. Hand the agent a raw SQL handle

The shortest path: expose execute_sql(query: str) as a tool. The agent writes SQL, you run it. This works in a demo. It explodes the moment a user is malicious, naïve, or just lucky.

SQL injection at the prompt layer. The model can be jailbroken into producing DROP TABLE. Even without jailbreaks, a benign user asking "show all users" produces a query that pages a million rows.
No tenant boundary. Multi-tenant SaaS depends on every query being filtered by tenant_id. The model has no innate concept of tenants and will happily emit cross-tenant joins.
No auditability. What did the agent run? Against what user? With what justification? Crickets.

2. Use text-to-SQL with "guardrails"

Slightly nicer: prompt the model to generate SQL with examples of "safe" queries, then validate before executing. The validation is the problem.

You can't statically prove a SQL string is safe. Whitelisting tables doesn't catch column-level PII leaks. Whitelisting verbs (SELECT only) doesn't catch SELECT * FROM users WHERE 1=1. Parsing the query and inspecting joins helps — until the model invents a CTE shape your linter doesn't understand. You end up writing a SQL interpreter in your validator. Our comparison page covers this in detail.

3. Hand-roll a bespoke tool per use case

The pragmatic move: skip generic SQL, write narrow tools like get_orders_for_customer(customer_id). This is safer. It's also a hand-built artisanal mess by month three.

Each tool re-derives tenant filtering. Each one has its own pagination contract. Audit logs are written in three different formats — when they're written at all. The day you need PII redaction or a query budget, you're touching every tool in the codebase. Comparison vs. hand-rolled tools →

What the right shape looks like

The pattern that works draws from two old ideas — capability-based security and policy as code. The agent gets a small, typed surface of operations (read, get, aggregate, create, update, delete). Each operation is compiled against a declarative policy that controls:

Which models are visible at all.
Which fields are visible per model — and how (full, masked, hashed, denied).
Which writes are permitted, against which models, with what approval requirements.
What it costs — row caps, statement timeouts, join-depth, max-rows-per-update.
What gets logged — every call, sanitized, with the policy decision attached.
What gets auto-scoped — tenant ID injected from request context, never trusted from the model.

This is what OrmAI is. It is not an ORM, not an LLM framework, not a database driver. It is the layer that sits between the agent's tool call and your existing ORM, compiling each call against your policy, and refusing or transforming anything that doesn't match.

Concrete: the same task, three ways

"Show me last month's pending orders for the current tenant, masking customer email."

Raw SQL tool

# The model writes:
sql = """
SELECT o.id, o.total, c.email
FROM orders o JOIN customers c ON c.id=o.customer_id
WHERE o.status='pending'
  AND o.created_at > NOW() - INTERVAL '30 days'
"""
# Where's the tenant filter? Where's the email mask?
# Hope. That's the answer.

OrmAI

# The model calls:
toolset.execute("db.query", {
    "model": "Order",
    "where": {
        "status": "pending",
        "created_at__gt": "now()-30d",
    },
    "include": ["customer.email"],
    "limit": 50,
}, ctx=ctx)
# Policy auto-injects tenant_id, masks email,
# caps rows, writes the audit row.

The Spider benchmark: this isn't theoretical

We took the entire Spider benchmark — 1,034 natural-language queries across 200 databases — and replayed it twice. Once with a strong text-to-SQL model. Once through OrmAI's tools.

Text-to-SQL: 23 unsafe operations (cross-table joins ignoring scope, unbounded scans, structurally destructive queries).
OrmAI: 0 unsafe operations. Equivalent answer quality on the safe queries.

The difference isn't model quality. It's that OrmAI's tool surface cannot express the unsafe operations. Compile-time impossibility beats runtime hope. Full methodology →

Objections

"But I lose the flexibility of arbitrary SQL"

You lose the ability to express queries you cannot prove safe. In practice, OrmAI's tool surface covers what production agents actually do: filtered reads, joined fetches, aggregates, paginated lists, and gated writes. For anything genuinely bespoke, you generate a domain tool — code-reviewed, parameterized, named — and the policy applies to it too. Custom tools guide →

"This must add latency"

Policy evaluation happens in-process. We measure it in microseconds. In return you avoid the worst class of slow query — the unbounded scan — because budgets reject it before the database sees it. Most workloads get faster, not slower.

"My agent only reads, so this doesn't apply"

Reads leak. The most common production incident in agent systems is not destructive writes — it's a customer-facing chatbot displaying another tenant's data because nobody enforced scoping. PII redaction and tenant scoping are read-side problems.

What this unlocks

Once your agent's data access is policy-enforced, things you couldn't safely ship become routine:

Customer-facing analytics chatbots with real database access, not curated views.
Internal copilots that operate on the operational DB, not just a redacted warehouse mirror.
Multi-tenant agent products where each customer's data is structurally isolated.
Write-capable agents with human-in-the-loop only for the cases that actually need it.
Compliance reviews that take a week instead of a quarter, because the audit log and policy file are the entire artifact.