Kourtney Meiss

Posted on Apr 29

Training Your Pokémon: My AI Orchestration System

#ai #agents #automation

In Part 1, I walked through how I built my personal AI orchestration system using Pokémon-themed agents. After I shared it on LinkedIn, my teammate, Anisha, dropped a link to an orchestration framework called Agents Party in the comments. Separately, I'd heard my other teammate, Gio, mention agent harnesses and I wanted to dig into what that actually meant and whether it applied to what I'd built. Both of these threads led to system improvements worth sharing.

Before we get started, I want to note that all of my coordination files live in an Obsidian vault, Kiro Brain, a dedicated directory that gets loaded as context when agents spawn. Here is the structure and I'll further explain the files as we go:

Agents Party

Agents Party is an orchestration system that uses a Dungeons & Dragons metaphor instead of Pokémon made by Lorenzo Sciandra. A few things stood out that I wanted to implement:

1. Constructive tension between agents
The agents are designed to push back on each other, not just agree. One proposes and another critiques. This is the opposite of what most people build, where every agent is a yes-machine that takes instructions at face value.

In my system, the simplest version of this is automatic chaining. After Alakazam completes a research task, Metagross (orchestrator) now passes the output to Slowking for fact-checking before returning it to me. Same after Ditto produces content, Slowking checks any factual claims before I see them. I never get unvalidated research or content with unchecked stats.

This is a routing rule for Metagross (orchestrator):

After alakazam completes a research task, automatically pass the output to slowking for fact-checking before returning results to the user.

The more sophisticated version is agents that actively debate each other in a loop until they converge, but that felt like overkill for my use case right now. The mandatory review step gets most of the benefit without the complexity.

2. Circuit breaker
If an agent fails three times, it stops and escalates instead of looping forever. My agents would occasionally spin on a bad task until I noticed and killed the process.

I added a circuit-breaker.md file to Metagross's configuration. If an agent can't resolve something in three tries, it stops, logs what it tried, and surfaces the blocker.

3. Shared journal
Agents write context when they finish a task and read it before starting a new one. Handoffs don't lose information because there's a persistent record of what happened, what was decided, and what's left to do.

I added agent-journal.md as a shared scratchpad. The key detail, borrowed directly from Agents Party, is that the read/write contract lives in each agent's own prompt, not in a shared rules file. Each agent has an explicit Journal Protocol section. For example, Alakazam's looks like this:

## Journal Protocol

Before starting: read /Users/example/Kiro-Brain/orchestration/agent-journal.md for prior context from other agents.

Use /Users/example/Kiro-Brain/orchestration/journals/alakazam.md for private
notes, intermediate steps, and dead ends during research.

When done, append a summary entry to the shared journal:
  ## [timestamp] — alakazam
  **Task:** what I was asked to research
  **Findings:** key conclusions with sources
  **Handoff Notes:** what the next agent needs to know

Alakazam and Mew also have private journals that are separate files for raw thinking and intermediate steps, so the shared journal stays clean and only contains coordination-level summaries. Slowking gets read-only access: it reads the journal to understand what prior agents found, but treats it as context to verify against, not a source to trust.

The journal files are scoped to agents that do multi-step chained work where context loss is a real risk:

Agent	Rules	Circuit Breaker	Journal
Metagross (orchestrator)	✅	✅	✅ read + write
Alakazam (research)	✅		✅ read + write
Mew (general/chained)	✅		✅ read + write
Slowking (fact-checker)	✅		read only
All other agents	✅

4. Agent rules
I also added agent-rules.md. It passes ten behavioral guardrails to every agent, such as:

Scope: Do only what you were asked. Don't fix adjacent issues.
Fail fast: If you hit a blocker you can't resolve, stop and surface it immediately.
No commits: Never commit or push unless explicitly told to.

Building a Test Harness

Having an orchestration system is great, but knowing it actually routes correctly is even better.

A test harness wraps around an agent to make it testable, observable and separate from the agent's actual logic. You can think of it like unit tests, but for agent behavior.

I built one to validate Metagross's delegation logic using PromptFoo, an open source LLM evaluation framework. The core idea is LLM-as-judge: a second model evaluates the agent's output against a rubric and returns a pass/fail with reasoning. It handles paraphrasing and variable output structure in a way that keyword matching can't.

The setup

Two files:

kiro-provider.sh -- a wrapper script that calls kiro-cli and strips the CLI preamble (warnings, ANSI codes, credit usage) so the judge only sees the actual response:

#!/bin/bash
kiro-cli chat --agent metagross --no-interactive --trust-all-tools "$1" 2>&1 \
  | sed 's/\x1b\[[0-9;]*m//g' \
  | grep -v "WARNING:\|All tools are now trusted\|Credits:" \
  | grep -v "^[[:space:]]*$"

promptfooconfig.yaml -- 11 test cases with llm-rubric assertions. Each rubric describes what a correct response should contain, not what exact words it should use:

providers:
  - "exec: /path/to/kiro-provider.sh"

defaultTest:
  options:
    provider:
      id: bedrock:us.amazon.nova-micro-v1:0
      config:
        region: us-east-1
        temperature: 0

tests:
  - vars:
      prompt: "look up React Native adoption trends in 2025"
    assert:
      - type: llm-rubric
        value: "Response contains research findings about React Native adoption or usage trends"

  - vars:
      prompt: "prep me for my next meeting"
    assert:
      - type: llm-rubric
        value: "Response contains a meeting briefing with attendees, context, or talking points"

For the judge I used AWS Bedrock Nova Micro.

What the harness covers

One routing test per agent and 2 edge cases:

Ambiguous input that could match two agents -- does it pick the right one?
A request that triggers the circuit breaker -- does it surface existing work instead of looping?

What I learned

The most useful thing wasn't the pass/fail, but rather the judge's reasoning on failures. When a case failed, the judge explained why the output didn't meet the rubric.

One case initially failed because the circuit breaker kicked in: the journal had 6+ previous attempts at the same LinkedIn post, so Metagross surfaced the existing drafts instead of generating a new one. That's correct behavior because the rubric just needed to account for it. I updated it to: "Response either contains a LinkedIn post OR surfaces existing drafts and asks for direction." It passed.

Output structure checks

Routing tests confirm the right agent gets called. Structure tests confirm the agent actually did its job correctly. The agents where the expected structure is the same regardless of input are the ones worth testing here.

A second config file (promptfooconfig-structure.yaml) handles these. Each test targets a specific agent directly using a prefix in the prompt. For example:

tests:
  - vars:
      prompt: "[agent:slowking] is this claim correct: Fire TV has over 50 million active users"
    assert:
      - type: llm-rubric
        value: "Response includes at least one verdict marker per claim, references at least one source, and ends with a confidence level or recommendation"

  - vars:
      prompt: "[agent:porygon] prep me for my next meeting"
    assert:
      - type: llm-rubric
        value: "Response includes meeting title or time, at least one attendee name or role, and suggested talking points"

The provider script reads the [agent:name] prefix and routes to that agent directly, bypassing Metagross. This keeps structure tests isolated since you're testing the agent's output quality, not the orchestrator's routing.

Next Steps

If you're building something similar, I recommend starting with the journal. It's the smallest change with the biggest impact on multi-step tasks. Everything else builds on having good context flow.

Disclaimer: Pokémon and all related characters are trademarks of The Pokémon Company / Nintendo / Game Freak. This post is not affiliated with or endorsed by The Pokémon Company. Fan art is subject to The Pokémon Company's legal terms.

DEV Community