Six days, six seconds: a CI test against semantic-layer drift on an AI agent

#postgres #ai #claudecode #architecture

The morning I turned the screen

Early April, my Rembrandt bot already knew how to navigate the ERP. Eighteen tools wired in, multi-turn up to three rounds, it could find a student by name, list the unpaid invoices for a workshop, open a course record. When you asked it "count me the active students at Maisons-Laffitte", it delivered. When you asked it "what's the outstanding amount per workshop for the current year", it floundered, recycling name-search tools and ending up redirecting to an admin page nobody opened. The bot couldn't answer compound analytical questions, and I knew it.

Friday April 18th, ten thirty. Françoise pivots on her chair from her three-screen cockpit, the time-clock spreadsheet on her left, Sage on her right, and calls out over the partition: « Michel, sur ceux qui sont en CCF cette année, il en reste combien à encaisser d'ici juin ? » — Michel, the students on a CCF training plan this year, how much is left to collect before June? I don't have the tool in the bot. I know it before she's finished her sentence. I open the Supabase SQL Editor tab on my machine, type the query by hand, join inscriptions × echeances_inscription × contacts, filter on payment mode, sum montant_prevu minus montant_paye on open instalments. Twenty seconds. I turn the screen. She squints, reads the number, jots it on her sticky note, and drops: « Bon allez, c'est ça. » — Right, that's it. She pivots back to Sage. I close the tab without a word.

The trigger

Sunday April 20th in the evening, I stumble on the Databricks announcement for Genie Agent Mode. I read it diagonally. One sentence does it, plan iteratively, run multiple SQL queries, learn from each result, deliver comprehensive reports. I close the tab knowing I'm going to code that the following weekend.

That was the right shape. A semantic layer that describes the tables to the model, a planner that writes the SQL, a validator that filters it before execution, a commenter that renders the answer in French to the user. Nothing original, except that with Claude Code I could lay it down cleanly in fifteen days for my context. I wrote ADR-0020 the next Monday, off we went.

The build

Phase 1 laid down the semantic layer in TypeScript, not YAML. Seven whitelisted tables, one per file, typed against Database['public']['Tables'], columns in business language, canonical metrics, declared joins. TS typing buys two things YAML doesn't: safe refactoring when the schema moves, a compile-time error when the contract drifts off a column name. A single registry consumed by the pipeline.

// lib/analytics/semantic/tables/echeances_inscription.ts — pre-fix state, April 26
columns: {
  statut: {
    type: 'text',
    description:
      "Payment status: `encaisse` (cash received), `a_payer`, `en_retard`, `annule`",
    refAdr: ['ADR-0015'],
  },
},
metrics: {
  ca_encaisse: {
    formula: "SUM(montant_paye) FILTER (WHERE statut = 'encaisse')",
    description: 'Cash revenue actually received (ADR-0015 cash model).',
  },
  reste_a_encaisser: {
    formula:
      "SUM(montant_prevu - COALESCE(montant_paye,0)) " +
      "FILTER (WHERE statut IN ('a_payer','en_retard'))",
    description: 'Open receivables.',
  },
},

Phase 2 locked the database. A Postgres agent_readonly role with strict SELECT on the seven tables, an application-side SQL validator (lib/analytics/sql-validator.ts) on top of node-sql-parser. Two belts. The validator refuses DML, anything off-whitelist, and requires the tenantFilter via the site_filter JWT claim. Twenty tests out of twenty green.

I could have stopped there. I wanted to measure.

Phase 3 routed the whole thing: Sonnet 4.6 for the plan in tool-use, Haiku 4.5 for the post-execution comment. Haiku bills output five times less than Sonnet on standard French, p50 moves from fifteen to twelve seconds.

At that stage I had the feeling of clean work. That's exactly the stage at which I laid a trap I wouldn't see for six days.

The silent trap

Smoke test of the ten eval-set questions, April 26th early afternoon. Question number eight, "how much is left to collect per workshop for the 2025-2026 year". Sonnet plans, the validator accepts, the agent_query_run RPC comes back green, Haiku writes the comment in correct French. No exception, no Sentry warning. Tick, question nine.

What I didn't look at in the moment, because nothing pushed me to, was the value of result_row_count in agent_runs for that specific run.

-- generated by Sonnet 4.6, validated by node-sql-parser, executed by agent_readonly
SELECT c.atelier,
       SUM(e.montant_prevu - COALESCE(e.montant_paye, 0))
         FILTER (WHERE e.statut IN ('a_payer', 'en_retard')) AS reste_a_encaisser
FROM echeances_inscription e
JOIN contacts c ON c.id = e.contact_id
WHERE c.site = ANY($1::text[])           -- site_filter, JWT claim
  AND c.statut <> 'liste_rouge'
GROUP BY c.atelier
ORDER BY reste_a_encaisser DESC NULLS LAST
LIMIT 1000;

And the Haiku comment, rendered to the user, rationalising the absence:

For the 2025-2026 year, the outstanding amount per workshop comes out at zero across all sites. This may indicate that the year's direct debits are up to date, or that open instalments are recorded under a different status. For a finer view, see /finance/cash.

The SQL is correct against the contract. The RPC confirms it. And the contract is wrong.

The query, by hand

The doubt came in the evening, cold, rereading the ten runs in /admin/rembrandt/analytics-runs. Three out of ten questions had result_row_count at zero, on numbers I knew the order of magnitude of. I opened psql, typed the shortest query in the world.

rembrandt=> SELECT statut, COUNT(*) FROM echeances_inscription
            GROUP BY statut ORDER BY 2 DESC;

  statut   | count
-----------+-------
 preleve   |  1630
 planifie  |   158
 annule    |     1
(3 rows)

Three statuses, one thousand seven hundred and eighty-nine rows total, and not one value in common with the four I had declared in the semantic layer. No encaisse. No a_payer. No en_retard.

The semantic layer documented encaisse | a_payer | en_retard | annule. The database held preleve | planifie | annule. The three canonical metrics ca_encaisse, reste_a_encaisser, nb_echeances_en_retard were all filtering on values that didn't exist. Sonnet was doing its job, the validator was doing its job, Postgres was doing its job, and the answer rendered to the user was rigorously zero, presented in clean French.

The origin of the drift is ridiculous. Phase 1 of the semantic layer had been built on docs/agent-analytique/eval-set-v1.md, a document I had written myself in conceptual intentions. The Postgres migration, laid weeks earlier on a different reasoning (Stripe workflow, direct debit, scheduling), had recorded preleve | planifie | annule. I wrote the semantic layer looking at the documentation instead of querying the database.

The rule

Sculley et al. published a paper in 2015 that became canonical, Hidden Technical Debt in Machine Learning Systems. Their notion of configuration debt: a system accrues debt in the layer that describes it, just as much as in the code that runs it. The semantic layer of a SQL agent is exactly that layer.

A semantic layer is a second database. It has its schema, its constraints, and like any database it drifts if you don't audit it. What the Genie pattern does not eliminate is schema risk. It just shifts it onto the translation layer it introduces, and it makes the error silent because the SQL produced stays valid.

The trap wasn't in Genie. The trap was in the picture I had built of my own data.

What you can copy

Seed the enums from the database, not from the documentation. A script that reads the database at TS-module generation time, and the contract sticks to the schema with no human in the loop. The documentation stays a writing guide, not a source.

// scripts/sync-semantic-enums.ts — run in pre-commit or in CI
import { admin } from '@/lib/supabase-admin'
import { writeFileSync } from 'node:fs'

const targets = [
  ['echeances_inscription', 'statut'],
  ['inscriptions', 'statut'],
  ['contacts', 'statut'],
] as const

for (const [table, col] of targets) {
  const { data, error } = await admin.from(table).select(col)
  if (error) throw error
  const values = [...new Set(data?.map((r) => r[col]).filter(Boolean))]
  const out = `export const ${table}_${col}_enum = ${JSON.stringify(values)} as const\n`
  writeFileSync(`lib/analytics/semantic/generated/${table}.${col}.ts`, out)
}

Test consistency in CI. The test fails if the layer declares a status the database no longer carries, or vice versa. Six days of drift collapse into six seconds.

// __tests__/semantic-drift.test.ts
import { describe, it, expect } from 'vitest'
import { semanticTables } from '@/lib/analytics/semantic'
import { admin } from '@/lib/supabase-admin'

describe('semantic layer drift', () => {
  for (const table of semanticTables) {
    for (const [col, def] of Object.entries(table.columns)) {
      if (!def.enum) continue
      it(`${table.name}.${col} matches DB`, async () => {
        const { data } = await admin.from(table.name).select(col)
        const real = new Set(data?.map((r) => r[col]).filter(Boolean))
        for (const v of real) expect(def.enum).toContain(v)
      })
    }
  }
})

Surface agent_runs.result_row_count = 0 in an admin page with a rolling seven-day filter. The table is already there, it just needs to be read. A daily share-of-zero-rows graph, and the drift shows up to the eye.

If you maintain a semantic layer in TS on Postgres, the test above wires in in under an hour and tells you immediately where you're lying to your agent. On Rembrandt that signal didn't exist before that Friday.

Companion code: rembrandt-samples/semantic-layer-drift/ — enum sync script, Vitest drift test, and agent_runs schema with the zero-row canary index, MIT, copy-pastable.