Why text-to-SQL fails in production

It works in demos. It works on a single user's database. It does not work as the safety story for a multi-tenant SaaS agent. Here's the failure-mode catalog.

Text-to-SQL is one of the cleanest demo categories in AI: ask a question, get a query, get an answer. It looks like the future. In production, it has a specific set of failure modes that are difficult to engineer around. This article catalogs them.

The five failure modes

If you ship text-to-SQL in production, expect each of these eventually.

1. The validator is always behind

Every text-to-SQL system in production has a layer that inspects the generated SQL and decides whether to run it. That layer has to encode every rule you care about: tenant scoping, allowed verbs, forbidden tables, column-level visibility, function blacklists, depth limits.

The model can compose SQL in shapes your validator doesn’t anticipate:

Subqueries that defer the unsafe predicate.
CTEs that read your “internal” table to compute a value used in an “external” table query.
Set operations that union safe rows with unsafe rows.
DB-specific functions that read filesystem paths or do network I/O.
New SQL syntax shipped in last Tuesday’s database upgrade.

Every miss is a potential incident. Every patch makes the validator more complex, which makes the next miss more likely. This loop never converges to “safe.”

2. Tenant scoping leaks via prompt drift

The standard advice is: include tenant_id = X in the prompt’s system instructions and trust the model to write it.

This works most of the time. It fails when:

The model drops the filter under load (a known empirical regression with long contexts).
The user phrases their question in a way that pulls the model toward a global aggregate (“across all our customers”).
A prompt-injection attack convinces the model that scoping is no longer required.
The schema description is updated but the example queries aren’t.

In our work with customers, this is the single most common production incident. An audit ten months later finds that ~0.5% of queries shipped without a tenant filter. That number is small enough to feel rare and large enough to be a major leak.

3. The fallback path is “show the user the SQL”

When the model produces SQL that doesn’t run, the standard recovery is “tell the user the query and ask them to clarify.” But your users are not SQL-fluent. They get a rejection they can’t interpret. They re-phrase the question. The model produces different SQL. It runs. They don’t know if it actually answered their question.

This degrades user trust in a way that’s slow to detect. Users gradually stop using the feature.

4. Audit trails are SQL strings

The audit log of a text-to-SQL system is a list of SQL queries. To answer “did the agent ever expose customer email?” you have to parse every query, follow joins, evaluate subqueries, and infer column visibility. This is a SQL interpreter you don’t want to write.

OrmAI logs structured tool calls. “Did the agent ever read customer.email?” is one query against a column. The same question against a SQL log is an afternoon of regex.

5. The performance cliff

The model can write expensive queries. Set a statement_timeout and you’ll catch some of them after they’ve consumed resources. Set a planner-level rejection (max scan rows) and you’ll catch them before. Most text-to-SQL setups have only the timeout. The result: occasional 30-second queries that lock out concurrent users.

This is fixable, but it requires the validator to inspect query plans. Now your validator is doing static analysis, plan inspection, and syntax checking. It’s a database-aware tool inside your application.

When text-to-SQL is fine

Three cases. Pretty much only three.

Read-only analytical exploration on a redacted warehouse with no PII and no tenants.
Internal tooling for SQL-fluent users who read the SQL before running it.
A single developer’s local database, where the developer is the only user.

In every other case — multi-tenant SaaS, customer-facing agents, write-capable agents, regulated data — text-to-SQL accumulates risk faster than it accumulates value.

What “guardrails” can and can’t do

There’s a class of products that sells “guardrails” for LLMs — input/output filters, jailbreak detectors, output classifiers. These are useful for content safety. They are largely orthogonal to data safety.

A guardrail can detect “the model output contains a credit card number.” It cannot detect “the SQL the model wrote will return rows that don’t belong to the current user.” That requires understanding the schema, the policy, the query plan, and the request context — which is what a policy engine does, and what a guardrail doesn’t.

If you’re spending money on guardrails for content safety, keep doing it. Don’t expect them to fill the data-safety gap.

The structural alternative

Make the unsafe operations unrepresentable at the tool layer. The agent can call db.query(model, where, ...) but cannot author SQL. The policy injects tenant scoping, redacts fields, caps rows, and writes the audit row. There is no validator to keep up with model creativity because there’s no SQL the model authored.

This is what OrmAI is. It’s not the only such layer — you could build something equivalent yourself. The point is that the shape of the answer is “constrain the interface,” not “validate the output.”

A note on hybrid approaches

Some teams use text-to-SQL inside a fenced subset: a read-only role with row-level security configured, no PII in the addressable tables, and a validator. This is much better than naked text-to-SQL. It’s still validator-as-defense, which means it’s still asymptotic.

If your data sensitivity makes asymptotic safety acceptable (read-only analyst tooling on a redacted warehouse), this is fine. If it doesn’t, you want the structural alternative.

What to do today if you’re running text-to-SQL

Three immediate actions, regardless of whether you migrate to OrmAI:

Audit the queries shipped in the last 30 days for tenant-filter omissions. You will find some. Fix the ones you can; report the ones you can’t.
Inventory your validator rules. Every rule that’s “we instruct the model to” should become “we enforce in code.”
Set up structured logs for SQL queries with parsed metadata (tables touched, columns selected, row counts). This is the bare minimum audit trail.

Once these are in place, you can consider migrating. We’ve helped half a dozen teams do it; the typical timeline is one to three weeks for a single agent surface.