Bala Paranj

Posted on May 11

Two Problems, Two Tools: Why AI-Assisted Scanning and Configuration Verification Solve Different Things

#aws #ai #security #cloud

There's growing confusion in cloud security about what AI-assisted tools can do. Some of the confusion comes from inflated claims about AI-powered vulnerability discovery. Some comes from genuine uncertainty about where different tools fit. But most of it comes from treating security as one problem when it's actually two. The two problems require fundamentally different approaches.

Before evaluating any tool, separate the problems.

Two classes of security problems

Class 1: Pattern Recognizable Problems

SQL injection is a vulnerability regardless of the operator. Unsanitized user input concatenated into a SQL query is dangerous in every application, every deployment, every organization. The operator's intent doesn't change the verdict. Nobody intends for their application to be injectable.

The same applies to XSS, buffer overflows, command injection, insecure deserialization, and most of the OWASP Top 10 for web applications. These are universal patterns. The vulnerability exists because of how the code works, not because of what the operator intended. A function that passes user input to eval() is unsafe whether the application is a healthcare portal or a recipe blog.

Key property: The verdict is independent of the operator's intent. The pattern generalizes across deployments. If you've seen one SQL injection, you can recognize the next one.

Class 2: Intent Dependent Problems

A public S3 bucket is correct for a static website hosting CSS files. The same configuration is a HIPAA violation when the bucket contains patient records. The bucket settings are identical. The verdict depends entirely on what the operator declared about the data inside.

A Bedrock agent with lambda:InvokeFunction on Resource: * is appropriate for an internal developer tool that needs to orchestrate arbitrary workflows. The same permission is catastrophic when the agent serves customer-facing queries and one of those Lambda functions reads from a PHI-tagged bucket. The IAM policy is identical. The verdict depends on the interaction between the policy, the agent's purpose, and the data classification of the resources the tool chain can reach.

Key property: The verdict depends on the operator's declared intent. The pattern does not generalize across deployments. Every organization's tags, policies, and trust relationships are unique. Seeing a thousand examples of "this configuration is unsafe" doesn't tell you whether the next configuration is unsafe — because the next operator's intent is different.

Why this distinction matters

Mixing the two classes leads to the wrong tool for the problem, and the wrong expectations for the results.

Using an AI-trained scanner on Class 2 problems produces the IDOR false positive: the scanner flags a critical IDOR on a public endpoint because it doesn't know the endpoint is supposed to be open. It's applying pattern recognition to a problem where the pattern doesn't generalize. Because the verdict depends on intent that only this operator knows.

Using a configuration consistency checker on Class 1 problems is equally wrong. It can't find SQL injection because SQL injection isn't a configuration contradiction — it's a code defect that exists regardless of what the operator declared.

The tools aren't interchangeable. They solve different problem classes.

What solves each class

Class 1: Pattern recognition

AI-assisted vulnerability discovery — LLM-powered pen testing, fuzzing, SAST, agentic security scanners — excels at Class 1 problems. The scanner learns patterns from training data: "this code shape leads to injection," "this API behavior indicates a broken access control." The patterns generalize because the vulnerabilities are universal.

The growing critique that AI-assisted scanning produces inflated findings is valid within Class 1. Some findings are in low-tier targets, some bugs aren't exploitable, some results are source-assisted. The community is right to demand validation. But the approach itself is sound: pattern recognition works when patterns generalize.

Where the approach breaks down is at the boundary between Class 1 and Class 2. The IDOR example sits on that boundary. Some IDORs are universal (accessing another user's private data is always wrong). Some are intent-dependent (accessing a public record through a predictable ID is by design). When the scanner can't distinguish the two, it produces noise. The fix isn't better AI. It's recognizing that the problem has crossed into Class 2, where the operator's intent determines the verdict.

Why more training doesn't fix Class 2

The reason is structural. In Class 1, the verdict is a function of the code: verdict = f(code). The function is the same for every deployment. Train on enough examples and the model approximates it well.

In Class 2, intent is a variable: verdict = f(configuration, intent). The configuration is visible. It's the IAM policy, the S3 tag, the Bedrock agent setup. But intent is supplied by the operator, and its value changes for every scenario. A public bucket is correct when intent is "serve static assets." The same public bucket is a violation when intent is "store patient records." The configuration is identical. The intent is different. The verdict flips.

Because intent is a variable in that function, it cannot be baked into a model during training. It is unknown before training — no dataset contains this operator's future decisions. It is unknown during training — the model learns patterns from other operators' configurations, none of which carry this operator's intent. And it is unknown after training — when the model encounters a new deployment, the intent variable has a value it has never seen because that value was created when this specific operator tagged this specific bucket and configured this specific agent. The variable doesn't exist yet when the model is trained. It comes into existence when the operator makes a deployment decision.

An AI scanner sees the configuration. It does not see the intent. It's solving an equation with a variable that has no value at any stage of the model's lifecycle. No amount of training data fills that gap because the gap isn't a data problem — it's a timing problem. The value is created after the model is deployed, by an operator the model has never observed, for a purpose the model cannot infer from the configuration alone.

This is why the false-positive IDOR is unfixable by training: the scanner sees the endpoint and the response. It cannot see whether the operator intended that data to be public. That intent didn't exist when the model was trained. It came into existence when this operator designed this application. The model will encounter this value for the first time at inference — with no prior example to generalize from.

Class 2: Constraint satisfaction

Configuration consistency checking asks: "Given what the operator declared — what's sensitive, what's public, who can reach what — do those declarations contradict each other?"

This is a constraint satisfaction problem. The operator's declarations are constraints: "this bucket contains PHI," "this agent serves public queries," "this knowledge base indexes this bucket." The tool checks whether those constraints are simultaneously satisfiable without creating a forbidden state. That's a satisfiability query — the exact problem SMT solvers like Z3 have been solving since 2007 for flight software, CPU verification, and compiler correctness.

The solver doesn't need training data. It doesn't need to generalize across organizations. It reads THIS operator's assertions and checks whether THEY contradict each other. The logic is the same for every deployment. The facts are different. That's what constraint solvers are built for — same engine, different inputs, deterministic answers.

A trained model answering the same question would need thousands of labeled examples of "this configuration is safe" and "this configuration is unsafe." It would produce a probability, not a proof. It would struggle with novel compositions it hadn't seen in training. And crucially, it would fail on intent — because every organization's tags, policies, and classifications are unique. You can show a model a thousand IDOR findings and it still can't determine whether endpoint /api/records/123 is supposed to be open, because that depends on what THIS operator intended for THIS application, and that intent isn't in any training corpus. It's in the tags, policies, and configurations that only this operator controls.

The right tool for each class

Property	Class 1 (universal patterns)	Class 2 (intent-dependent)
Example	SQL injection, XSS, buffer overflow	PHI bucket indexed by public RAG pipeline
Verdict depends on operator intent?	No — always a vulnerability	Yes — depends on declared data classification
Pattern generalizes?	Yes — one SQLi resembles the next	No — every operator's intent is different
Training data helps?	Yes — more examples improve detection	No — each deployment's intent is unique
Right approach	Pattern recognition (AI/ML)	Constraint satisfaction (SMT solvers)
Right tool	SAST, DAST, fuzzing, AI scanners	Configuration consistency verification

What the scanner sees vs. what consistency verification sees

Now that the classes are separated, the gap becomes visible.

Consider a Bedrock agent in AWS. A component-level scanner checks each resource individually:

Bedrock encryption:     ✅ PASS
Bedrock VPC:            ✅ PASS
Bedrock model access:   ✅ PASS
S3 encryption:          ✅ PASS
S3 public access:       ✅ PASS
Lambda encryption:      ✅ PASS

6 checks. 6 passes. COMPLIANT.

These are Class 1 checks applied to configuration: "is encryption on?" is a universal pattern. The answer doesn't depend on the operator's intent. Encryption should be on.

But the agent's execution role has lambda:InvokeFunction on Resource: *. It can invoke any Lambda in the account. One of those Lambda functions reads from a bucket tagged data_classification: phi. The knowledge base indexes the same PHI bucket for RAG retrieval. These are Class 2 problems — the verdict depends on the interaction between the agent's permissions, the Lambda's access, and the bucket's data classification. All three are intentional configurations. The contradiction is in their composition.

Consistency verification extracts facts from all three services, checks whether they compose into a forbidden state, and reports: "Your customer-facing chatbot can return patient records through its Lambda tool chain." Five individual findings compose into three CRITICAL compound chains. The scanner's 6/6 PASS and the consistency checker's 3 CRITICAL chains are both correct — they're answering different questions about different problem classes.

What consistency verification provides

Compound detection across services. Individual checks can't find cross-service compositions. When the agent's role is overpermissioned AND the Lambda reads PHI AND the knowledge base indexes PHI, the compound fires as a single chain with a description naming the specific attack path.

Intent-aware evaluation. The IDOR problem doesn't exist here because the operator's intent is encoded in the configuration. Tags declare sensitivity. Policies declare access. The tool doesn't guess whether data is sensitive — the operator tagged it. "Your PHI-tagged bucket is indexed by your public-facing knowledge base" isn't a guess. Both the tag and the configuration are explicit operator decisions. The finding resolves into a concrete decision (remove the tag or change the data source), not a triage exercise.

Mathematical proof. Z3 returns sat (the forbidden state is reachable) or unsat (mathematically impossible). Not a confidence score. Not a probability. A proof that can be verified by running the same query again, or by a different solver. For compliance evidence, proofs beat confidence scores.

Traceability. Every step in the compound chain traces back to a specific property in a specific configuration file through a deterministic identifier. One grep returns the observation file, the property path, and the timestamp. The team doesn't search for the root cause — the tool names it.

Air-gapped operation. No credentials, no API calls, no network access. Static snapshot analysis on a laptop. For organizations under GDPR, HIPAA, or FedRAMP, this isn't a feature — it's a prerequisite.

What consistency verification does NOT do

It doesn't find SQL injection. It doesn't discover XSS. It doesn't fuzz API endpoints. It doesn't scan dependencies for CVEs. It doesn't evaluate models for prompt injection. These are Class 1 problems. They need Class 1 tools.

Consistency verification catches what sits between the Class 1 tools: the compound configuration errors that exist in the interaction between services, where every individual check passes and the aggregate state is unsafe.

What's missing: your intent

There's a prerequisite that consistency verification cannot work without: the tool cannot read the security engineer's mind. It reads configurations. If the engineer's intent isn't expressed in the configuration, the tool has nothing to reason about.

If your PHI buckets aren't tagged data_classification: phi, the compound chain "knowledge base indexes PHI data" cannot fire. No tag, no finding. The engineer has to declare what's sensitive before the tool can check whether the declaration is violated.

Tags are intent declarations. data_classification: phi is not metadata for a dashboard. It's a machine-readable statement: "this bucket contains patient records." Without it, the tool treats the bucket like any other bucket. The compound detection that joins "this knowledge base indexes this bucket" with "this bucket contains PHI" requires both declarations to exist.

Policy absence is an intent signal. When a Bedrock agent has no guardrail configured, the tool reads that absence as a fact: "guardrail is not present." The configuration is the expressed state — the tool reads what's there and what's missing.

Chain definitions encode security judgment. When a control author writes a compound chain that says "if the agent's role is overpermissioned AND the knowledge base indexes PHI AND there's no guardrail, that's CRITICAL" — they're encoding a security team's judgment about which interactions matter. The engine doesn't decide what's dangerous. The control author decides.

This is similar to how you write a unit test: You know the expected value of the right result for a given input, you encode that human judgment in the unit test.

Tagging discipline is the prerequisite. An organization that doesn't tag its sensitive data, doesn't classify its environments, and doesn't label its IAM roles by purpose will get fewer findings — not because their infrastructure is safer, but because they haven't expressed enough intent for the tool to reason about.

That's the correct design. The alternative is guessing. Guessing produces inflated findings, false-positive IDORs, and the critical-severity noise that the security community is rightly pushing back on. Consistency verification doesn't guess. It requires the engineer to say what matters, and then checks whether the infrastructure respects what they said.

How to think about your tooling stack

Question	Problem class	Right tool
"Does my code have bugs?"	Class 1 — universal	SAST, DAST, fuzzing, AI-assisted pen testing
"Does my model have vulnerabilities?"	Class 1 — universal	Model security scanners, red teaming
"Are my components configured correctly?"	Class 1 — universal	CSPM, CIS benchmarks, component scanners
"Do my configurations contradict each other?"	Class 2 — intent-dependent	Compound risk / consistency verification

The first three rows are mature. The fourth is new. It doesn't replace the others — it catches what they structurally cannot.

The critical mindset the security community applies to AI-assisted scanner findings — "is this finding real? is the bug exploitable? is the target meaningful?" — is right. Apply it to consistency verification too. But ask the Class 2 version: "does this configuration really say what the tool claims?" (every fact is verified against the raw configuration file). "Is the compound really a risk?" (the chain description names specific services and data classifications from the operator's own declarations). "Can I verify it independently?" (the solver is open source — run the query yourself).

The findings aren't inflated because they aren't heuristic. They're logical consequences of the operator's own declarations. The operator said "this is PHI." The operator said "this agent can invoke any Lambda." The tool says "those two facts compose into a breach path." That's not an opinion. That's arithmetic on expressed intent.

The consistency verification approach described in this article is implemented in Stave, an open-source static analysis tool for cloud infrastructure configurations. Stave evaluates 2,650+ controls across 74 AWS service domains, exports facts to nine independent reasoning engines — including SMT solvers used to verify flight software — and detects compound risk chains across services. 32 AI agent identity controls and 5 AI-specific compound chains cover Bedrock, SageMaker, and Lambda. All analysis runs on air-gapped snapshots with no cloud credentials required.

DEV Community