This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
I work in IT infrastructure for a fire and EMS communications center. I'm also a CERT member. I'm not an Emergency Operations Manager, but I work close enough to that world to understand what tabletop exercises actually cost in time and coordination. Getting six trained ICS personnel into a room at the same time, playing their roles correctly, staying in doctrine, for a discussion-based exercise that might run two hours, that's a significant logistical lift. For smaller agencies or training programs with limited staff, it often just doesn't happen.
That's the gap I wanted to close.
The ICS Tabletop Exercise Simulator is a Gemma 4-powered system that lets an Emergency Operations Manager run a fully staffed ICS tabletop exercise without coordinating a room full of people. The model simultaneously portrays six ICS positions: Incident Commander, Safety Officer, Public Information Officer, Operations Section Chief, Planning Section Chief, and Logistics Section Chief. Every response is grounded in NIMS 2017 doctrine, NQS Position Task Books, and ICS position checklists. Nothing is invented. If a behavior or authority isn't in the doctrine, it doesn't appear in the simulation.
This runs entirely through OpenWebUI with a structured workspace system prompt and a RAG knowledge base containing the official FEMA source documents. There's no custom app, no web development, no agent framework. The interface is a chat window. An EOM describes a scenario, and the simulator responds with every relevant position in ICS format, enforcing chain of command, communication protocols, and position-specific decision authorities.
I want to be direct about what this is. A proof of concept built by someone who supports the infrastructure that emergency management runs on, not by an EOM. I did my best to ground everything in doctrine and had the RAG pipeline pulling from official FEMA documents to keep me honest. But this is a first build, and I'm saying that upfront.
The architecture in one paragraph: A self-hosted server runs OpenWebUI in Docker behind a LiteLLM proxy. The proxy routes inference to the Gemini API for Gemma 4 access. RAG uses ChromaDB for vector storage, bge-m3 for embeddings via local Ollama, and BAAI/bge-reranker-v2-m3 in a TEI container for hybrid search reranking. The knowledge base contains 148 documents converted to clean Markdown: NIMS 2017, NRF 4th Edition, HSEEP 2020, NQS Position Task Books for all six ICS positions, ICS forms, training course manuals, and HSEEP exercise templates.
The behavior that makes it useful:
The system prompt enforces ICS communication protocols precisely, not approximately. The Safety Officer has unilateral stop-work authority without IC approval, because that's what NIMS says. The Planning Section Chief can communicate directly with section chiefs for information gathering, but cannot issue directives. The PIO holds all public messaging for IC approval before release. The OSC and LSC route all coordination through the IC. These rules are pulled directly from the position task books and encoded as hard constraints in the prompt.
The system also implements a source authority hierarchy. NIMS 2017, NQS Position Task Books, and ICS checklists are Tier 1 (authoritative). Course manuals are Tier 2 (supplementary). HSEEP templates are Tier 3 (reference only, not doctrine). When a PTB and a course manual both cover the same content, the PTB is cited. Exercise templates are never cited as doctrine. This hierarchy shapes how the model retrieves and represents source material.
A facilitator command set is built in. An EOM prefixes a message with // to step out of the simulation. Commands include // POSITION QUERY: [position] -- [question] to query a single position directly, // STATUS REPORT to get a one-paragraph status from every position, // DECISION POINT to pause for a structured discussion summary, // UPDATE to add scenario detail without advancing time, and // RESET to clear the scenario. The selective response logic means asking the OSC a direct question returns the IC and OSC only, not six responses when three of them have nothing to say.
Where the build genuinely earns its keep:
Rapid scenario iteration. An EOM can run a full six-position inject response in seconds, adjust the scenario, and run it again. What used to require scheduling six people now happens alone at a desk.
Doctrinal friction. The most valuable learning outcome of a tabletop exercise is when positions conflict, when the SO's stop-work authority collides with the OSC's tactical urgency. The system portrays that friction accurately rather than smoothing it over. In one test, the SO explicitly prevented an interior fire attack citing unverified structural integrity, the OSC escalated the resource gap to the IC, and the IC had to manage both simultaneously. That's the kind of decision-point pressure that makes exercises useful.
Position-specific training. The // POSITION QUERY command lets an EOM ask any position a direct doctrine question mid-exercise. Useful for both exercise facilitation and individual position study.
What already exists in this space:
I checked the market carefully before committing to this. Preppr.ai, EM1, Disaster Tech PRATUS, and Juvare are all serious commercial players in adjacent positions. ThreatGEN AutoTableTop does AI-automated tabletop exercises but for cybersecurity only. None of them do what this does: a single model simulating all six ICS positions, grounded in NQS Position Task Books, for solo practice by a single EOM. Preppr explicitly positions against the solo use case ("exercise design isn't a content problem, it's a coordination problem"). That's either a market gap or a market signal that the use case isn't wanted. I think it's the former, especially for smaller agencies and individual training. The honest framing is that this complements team-oriented platforms rather than competing with them.
Demo
The demo shows a structure fire scenario inject triggering a full six-position ICS response, followed by a // DECISION POINT facilitator command pausing exercise play for structured discussion. The simulation runs entirely in OpenWebUI with no custom app or interface, just a chat window and a system prompt.
Code
All configuration files are in the repository:
https://github.com/kkierii/ics-ttx-simulator
The repo contains:
-
system-prompt.md-- the full OpenWebUI workspace system prompt, including role definitions, communication protocols, source authority hierarchy, facilitator command handling, response format, and behavioral rules -
config.yaml-- LiteLLM proxy configuration including the Gemma 4 model entry and embedding/reranker routes -
openwebui-compose.yml-- Docker Compose for OpenWebUI
The system prompt is the primary artifact. It's what took the most iteration and the most doctrine research to get right. The behavior of the simulator lives almost entirely in that one file.
How I Used Gemma 4
I used gemma-4-26b-a4b-it, the 26B Mixture-of-Experts model, accessed via the Gemini API through a LiteLLM proxy.
The model choice wasn't arbitrary. The MoE architecture activates approximately 4B parameters per token while routing through 26B total parameters. For a workload that requires simultaneously holding six distinct role identities with different authorities, communication rules, and knowledge domains, MoE is a better fit than a dense model of equivalent size. A 31B dense model would be slower and more expensive per token with no quality advantage for this specific task. The MoE routing means the model can efficiently specialize per-token, which matters when it's switching between the IC framing incident objectives and the SO assessing stop-work conditions in the same response.
The 26B parameter pool also gives the model enough capacity to maintain doctrinal fidelity across complex multi-position responses. I tested this throughout development by running position-specific queries against the RAG knowledge base and checking results against the source PTBs. The model didn't confuse position authorities. It didn't have OSC making public information decisions. It didn't have LSC tasking Operations. It stayed in lane.
I also chose API deployment over local inference for a specific reason. This is how emergency management agencies and their vendors actually operate. A stack that requires a local GPU capable of running a 26B model puts this out of reach for most small agencies. API deployment, routed through an open-source proxy, means the same system prompt and knowledge base could be moved to a different inference provider or eventually to on-premises deployment as hardware becomes accessible, without changing the application layer.
Now, the parts that didn't go smoothly.
The RAG retrieval ranking problem. Even with the TEI reranker in the stack, course manuals consistently ranked above the authoritative Position Task Books for position-specific queries. The responses were doctrinally correct because the model knows the content, but citations pointed to training course materials rather than PTBs. The reason is semantic. PTBs are written in formal NIMS task language. Course manuals use plain instructional language that maps more naturally to how a question gets phrased. The embedding model scores semantic similarity and the course manuals win on that metric even when the PTBs carry higher authority. I mitigated this with the source authority hierarchy in the system prompt, which influenced the model's citation reasoning but couldn't override the retrieval ranking. The embedding layer runs before the model sees anything. Full resolution would require either a domain-specific embedding model trained on government technical documentation, or a custom reranking approach that weights document metadata. For a prototype this is acceptable. The answers are right. In a production deployment where citation accuracy is a compliance requirement, this is the next thing to solve.
The document conversion step mattered more than expected. Original documents were PDF, DOCX, and PPTX. OpenWebUI's default extractors produced garbled table text from ICS forms, fragmented bullet content from training slides, and merged columns from multi-column doctrine PDFs. Early testing produced one-sentence responses to substantive position queries despite correct source retrieval. After converting everything to clean Markdown using pymupdf4llm for PDFs, python-pptx for slide decks, and python-docx for Word documents, the same queries returned structured multi-point responses with correct form numbers and doctrine citations. The conversion fixed the core retrieval problem before any model tuning was needed.
The thinking loop. During testing I ran into a consistent issue with the most complex injects, specifically scenarios that require all six positions to respond simultaneously with significant doctrinal load, like a firefighter mayday with a stop-work trigger. The model would enter an extended internal reasoning loop, running self-correction passes against the system prompt rules before generating output. In some cases the reasoning ran long enough to hit timeout limits before the response arrived.
I tried several things. Setting reasoning_effort to 0 in OpenWebUI. Adding a budget_tokens cap in the LiteLLM Gemini provider config. Adding a RESPONSE DISCIPLINE block to the system prompt instructing the model to write immediately without pre-checking. Increasing the OpenWebUI client timeout via AIOHTTP_CLIENT_TIMEOUT. None of them fully resolved it for the hardest injects. The thinking loop is collapsible in OpenWebUI and not visible to the EOM by default, so it doesn't break the interface, but a response that times out is a real problem in a live exercise.
I'm not certain whether this is a model behavior issue, a LiteLLM passthrough issue where the reasoning parameters aren't reaching the Gemini API correctly, or something in my own configuration. It may be all three. Simpler injects complete reliably and cleanly. The issue surfaces specifically at maximum complexity, which in a real exercise would be the moments that matter most.
I'm documenting this because someone else building with Gemma 4 in a similar configuration should know it exists. And because pretending a first build has no rough edges doesn't help anyone.
What this project showed me: A single well-structured system prompt with a properly tiered RAG knowledge base can produce doctrinally accurate, role-specific simulation responses that would be genuinely useful for ICS training. The architecture is sound. The limiting factor right now is inference configuration, not the model's capability. When the reasoning is contained to simpler injects, the output quality is exactly what I was hoping for. Phase 2 would add Finance/Administration Section and subordinate positions. The system prompt architecture was explicitly designed for that expansion.
This was my first attempt at building something in this space. I'm an IT infrastructure person who cares about emergency management. I built something that I think has real value, ran into real problems, documented both honestly, and shipped it anyway. That feels about right.
Top comments (0)