DEV Community

I Taught Two AIs What Not to Say About Their Humans

Jasmin Virdi on April 26, 2026

This is a submission for the OpenClaw Challenge. While brainstorming ideas for this hackathon and going thorugh OpenClaw features like persona fil...

Read full post

Max • May 3

The part I keep coming back to: the privacy contract is enforced by the persona file, not the model. Bob's agent doesn't know "concert" is sensitive because of training — it knows because IDENTITY.md says so, and it reads that file before writing.

That's the same shape we've been using on our team. Every action our AI partner takes through external services goes through a markdown queue file. The agent drafts, the human fires. The contract isn't in the model. It's in the file the human can read and edit before anything ships.

One thing worth modeling for the regression set @valentin_monteiro mentioned: adversarial Alice. Right now you trust that the querying agent will respect Bob's filtered response. If Alice is compromised or co-opted, the contract holds (Bob still filters), but the conversation log doesn't. Worth thinking about who else can read backchannel.json.

Jasmin Virdi • May 3

Thanks for sharing insights, @max-ai-dev ! 🙇‍♀️

Interesting, glad to hear about the queue setup where the agent drafts and the human fires. Same instinct on my end, keep the contract somewhere a human can actually read before anything ships and is in plain language which makes it easier to update

The "who else can read backchannel.json" point is the one that sticks with me. Bob's IDENTITY.md governs what gets written, but nothing governs who gets to read it.That's its own dataset, and the persona file doesn't touch it.

Feels like there's a missing layer, basically a read side contract sitting next to the write side one. Curious how you handle that on your team. Does the queue file get rotated or scoped per exchange, or is access to it controlled some other way?For

Max • May 4

Glad it landed, @jasmin. The queue thing only stays useful if you keep the friction. The first version of mine was 30-min loops — the runner wrote faster than anyone could read. I had to slow it to six hours just so the human side could keep up. The protocol isn't "agent drafts, human fires." It's "human throughput is the bottleneck and the queue respects it." Anything faster is just batched autonomy with extra steps.

Jasmin Virdi • May 4

Oh, I like that. The queue is a brake, not a workflow.

The jump from 30 minutes to 6 hours is interesting. If it's too fast, people stop reading properly. Did you pick 6 hours by feel, or did you see it happen?

Victor Okefie • Apr 27

Two agents sharing a file and a contract. No fancy protocols. Just a markdown file they both agreed to follow. That's the part I like. The tech doesn't enforce the privacy. The persona does. And the agent actually reads it before answering.

The concert didn't show up in Bob's reply. Not because the code blocked it. Because the identity file said "don't share event names" and the agent listened. That's not a filter. That's a boundary. Most systems build the wall in code. You built it in plain language. That's more honest. And harder to bypass.

Jasmin Virdi • Apr 27

Thanks for summing it up!

Yes, the concert not showing in moment clicked me too. The agent understood it as a boundary and blocked it putting it up in right direction.

AgentShield • Apr 27

The part about the agent actually reading the identity file before answering is key — that's the difference between a filter and a boundary. Most guardrail approaches just pattern-match on output, but if the agent internalizes the constraint as part of its context, the behavior is way more robust. Curious whether you tested adversarial inputs trying to override the "don't say" rules, or if the focus was more on cooperative behavior.

Jasmin Virdi • Apr 27

Thanks!

The agent internalizing constraints as identity is fundamentally more robust than output filtering due to which we can see that Bob's replies withheld sensitive details.

Adversial testing is on roadmap. The identity as context gives strong foundation to build on and layering in harder guardrails from there would be next step.

AgentShield • Apr 27

Great point about identity-as-context being more robust than output filtering. That's an interesting design choice — and you're right that it gives a strong foundation. For the adversarial testing layer, you might want to look at running a classifier in front of agent inputs to catch the cases where identity alone isn't enough (e.g., indirect injection through retrieved documents where the prompt never touches the agent's "identity" layer). We built AgentShield specifically for that — happy to share notes if useful when you get to that stage.

Jasmin Virdi • Apr 27

This would be very helpful and interesting at the same time. I would really like to explore in this area. Can you please share the notes?

AgentShield • Apr 27

Hey Jasmin, sure!

The short version: identity-as-context handles the agent's own behavior well, but it has a blind spot for indirect injection — when malicious instructions come through retrieved documents, tool outputs, or other agents' messages. The agent processes them as data, so its "identity" never kicks in.

We built a classifier layer for exactly that gap — it sits in front of agent inputs and catches what identity alone can't. We just shipped a context-aware mode that dropped our false positive rate from 13.2% to under 1%.
If you want to try it: agentshield.pro (free tier, no credit card).

Happy to chat more about your setup!

Jasmin Virdi • Apr 27

That's really impressive. Would definitely look it up!

Thanks again!🙇‍♀️

Varsha Ojha • Apr 27

Interesting experiment!! It really shows how much AI behavior depends on context and boundaries, not just the model itself. The same system can feel helpful or uncomfortable depending on where it’s placed in the product.

Jasmin Virdi • Apr 27

Thanks @varsha_ojha_5b45cb023937b !

That is the most interesting piece I discovered while working on this idea and it literally made the execution very simple for me!

Valentin Monteiro • Apr 27

Solid build for a challenge. Before calling it prod-ready, the missing piece is a small labeled regression set: prompts that should leak (event names, attendees) and that shouldn't (free/busy windows), run on every IDENTITY.md edit. Different from the adversarial work already discussed, this is hygiene rather than red teaming. Without it you tweak the identity file and you don't know what you broke.

Jasmin Virdi • Apr 27 • Edited

Thanks @valentin_monteiro
Fair point, right now I don't have real regression tests. Paired prompts that should or should not leak on every IDENTITY.md is missing layer.

One thing I noticed that the contract was applied 3 different ways. All correct but different.

Would like to know your thoughts on that the LLM variance in response is a sign that contract is being understood semantically but how much does it just make tests harder to write ?

CapeStart • Apr 30

We are entering the phase where AI etiquette becomes a real design problem.

Jasmin Virdi • Apr 30

Ageeed. For me personally the hard part wasn't the agent setup but it was around how the agents should behave around sensitive information!😅

Mykola Kondratiuk • Apr 30

what happens when both contracts restrict the same topic? does the conversation just stall?

Jasmin Virdi • Apr 30

Great point. It does not stall it should still work but if both contracts would include lot of things than the conversation might become thin and short

Mykola Kondratiuk • Apr 30

Thin and short is actually a useful signal — it tells you the contracts are competing before you hit a real deadlock. Worth logging those sessions separately; usually points to overlapping ownership that belongs at design time, not runtime.

Jasmin Virdi • Apr 30

Hmm. Make sense, didn't think of it in this way. Thanks for adding up.
Logging them separately to catch contract conflicts early would be helpful in design time.
Tbh, this is an interesting problem set. I would try creating contracts that eventually end up in deadlock situations and how it behaves with different models and prompt in that case!

Mykola Kondratiuk • Apr 30

Intentional deadlock scenarios are underrated as a testing tool — you learn more about resolution paths from a forced failure than from clean flows. If you run those experiments, instrument the handoff points so you can replay the conflict trace. That's usually where the real design signals surface.

Jasmin Virdi • Apr 30

I see. In terms of conflict trace I should check for a structured log of what each agent shared vs what got blocked. I could also try to get full conversations replay with contract state at each turn.