Denis Babkevich

Posted on May 6

I Designed the AI Agent as a Runtime from Day One, Not as a Chat with Functions

#agents #ai #architecture #showdev

Three months ago I sat down, sketched the architecture of Spectrion in my head, and started writing code.

From the outside, the first version could be described very simply:

an AI agent for iPhone that does not only answer, but can act

But internally I did not want to build "a chat with tools".

I wanted an environment where an agent could keep working beyond a single message: create a reminder, continue tomorrow, watch a page, create a workflow, raise an alert, verify its own output, hand part of the work to a Mac or CLI runner, keep task state, remember unfinished items, and refuse to execute an action if policy says no.

So the starting point was not:

build a chat
then add functions
then somehow bolt automation on top

It was closer to:

the agent is the runtime
the chat is only one interface to it

iPhone was the first user-facing shell. But the idea was broader from the first day: tools, memory, task board, background jobs, workflows, approvals, policies, watchdog, self-created tools, device mesh, and managed business mode should not be separate islands. They should live inside one agent runtime.

In this article I will walk through why I started from an execution loop rather than a chat loop, what subsystems this required, and why function calling is only a small part of a real agent system.

What I Wanted

I did not want an assistant that says:

here is how you can create a reminder

I wanted an agent that creates the reminder.

Not:

here is how you can search and compare options

But:

search, compare, save the result, and return the conclusion

Not:

I can help you make a plan

But:

keep the task, check that steps were not forgotten,
continue if the work stopped too early,
and remind me when the next step is needed

The important difference is not that the agent can call a tool.

The important difference is that a task can live longer than one message.

For example:

Watch this page.
If a new version appears, explain the changes,
create a task to update the project,
and remind me tomorrow.

For a normal AI chat, this looks like a single request.

At execution level, it is a process:

1. Understand the task
2. Check the page now
3. Store a baseline
4. Create a monitoring loop
5. Wake up on schedule
6. Compare changes
7. Run reasoning when an alert appears
8. Create a task
9. Send a notification
10. Keep the follow-up for tomorrow

That is why I did not treat Spectrion as a mobile chat with functions. I needed a runtime for agentic work.

The shortest version:

not a chat with functions,
but a runtime for tasks that last longer than one message

Why Function Calling Is Not Enough

The basic architecture of an AI chat is:

user message
  -> LLM
  -> assistant message

With tools:

user message
  -> LLM
  -> tool call
  -> tool result
  -> LLM
  -> assistant message

For simple requests this is enough.

Examples:

What is the weather tomorrow?

Create a reminder at 10:00.

But once the agent has to actually work, problems appear that cannot be reliably solved with a prompt.

A model can return a tool call as text. A provider can hang. A tool result can be too large. Context can overflow. The user can write a follow-up while the run is still executing. A scheduled task can arrive in the middle of another run. A proactive monitor can raise an alert. A subagent can finish a background task. A workflow can require approval. The model can say "done" while the todo list is still open.

I designed the system as if these cases were not exceptions, but the normal working environment of an agent.

Problem	Why a prompt is not enough	What the runtime needs
Tool hangs	The model no longer controls the process	Timeout, cancellation, retry
Tool result is too large	Context overflows	Truncation, artifacts, summaries
User writes a follow-up during a run	Chat loop does not model competing events	Queues and execution lanes
Scheduled task arrives during other work	This is not just another message	Unified event queue
Model says "done" but task is not done	Model optimizes the answer, not state	Todo/task board + watchdog
Tool is forbidden by policy	Hiding the schema is not enough	Executor-level policy gate
Workflow may be dangerous	LLM may miss edge cases	Preflight review / approvals
Memory accumulates noise	"Remember everything" is unsafe	Scope, TTL, confidence, rollback

Function calling is a way to ask the model to choose an action.

It does not answer these questions:

when should a task run again?
what happens if a tool hangs?
who checks whether the agent stopped too early?
where is task state stored?
how do you avoid showing the model a thousand tools at once?
how do you block a tool at execution level, not just in the prompt?
how does a proactive alert enter the reasoning loop?
how do iPhone, Mac, and CLI coordinate work?

The LLM does not answer those questions.

The runtime does.

Agent Runtime as an Operating Loop

In Spectrion, a normal chat turn is only one input.

Different events can enter the same runtime:

manual user message
scheduled task
proactive alert
workflow node
subagent result
nested call
channel message
heartbeat check-in

Each event type has its own settings, but the pipeline is similar:

event input
  -> classify event
  -> prepare context
  -> choose agent / model / tools
  -> inject memory, skills, project context
  -> stream LLM
  -> parse tool calls
  -> execute tools with policy, approval, timeout, progress
  -> add tool results
  -> loop again if needed
  -> post-run tail:
       watchdog
       proactive queue
       task board
       memory proposals
       workflow routing
       channel routing
       cleanup

The main difference from a normal chat is that an assistant message is not the only output.

The output can be:

created task
updated memory
scheduled workflow
tool artifact
notification
alert
subagent run
pending approval
task board update
audit event

So the agent is not "a model that answers".

It is a system that carries work.

In code, this loop is bounded and controlled.

AgentRuntime keeps ConversationRunState per conversation: streaming text, status, token budget, todo state, abort flag, and rolling summary. sendMessage runs preflight, directives, hooks, user-message persistence, active task envelope, media/link context, and only then enters the outer todo loop.

The tool loop is limited by clampedMaxToolIterations. If the model keeps calling tools for too long, the runtime stops the loop at a safety cap. Inside each iteration there are watchdog timeouts: a longer one for a normal turn and a shorter mode when queued user input or tool calls already exist.

Tool calls are not simply executed one by one. The runtime groups serial-only tools by name, runs other calls through a task group in parallel, and then restores result order. That lets independent calls run faster without breaking tools that require sequential execution.

What Lives Inside the Runtime

If you only look at the top-level pipeline, it is easy to underestimate how much infrastructure sits around it.

The runtime contains several layers that are not there for a nice demo, but for keeping long tasks from falling apart.

ContextManager and Compaction

ContextManager builds provider-visible context.

It can include:

system prompt
active skills
trusted project context
Memory V2 retrieval block
session working memory
Device Mesh context
recent messages
tool results
active task envelope

The context is not just "cut from the beginning".

There is MessageCompactor, TranscriptSummarizer, rolling summary, proactive compaction threshold, rehydration budget, and protected tail. Old tool results do not always need to be carried in full: MicroCompactor can replace old read-heavy tool results with short summaries while preserving recent results and important anchors.

There is also token-estimator calibration: the runtime compares its estimate with actual provider context usage and adjusts scale by provider/model.

The goal is:

preserve the meaning of old work
  -> avoid context-window overflow
  -> keep the active objective visible
  -> avoid asking the user to repeat the task

Active Task Envelope

Long tasks need more than a todo list.

There is an ActiveTaskEnvelope: a provider-visible anchor for the active task.

It stores:

original visible user request
latest visible user update
current objective
constraints
exact final markers
pending todos
artifact promises
tool progress
lifecycle
awaiting user question

Lifecycle can be:

active
background
subagent
scheduled
proactive
waiting_for_user
completed
cancelled

If context is compressed, if a queued follow-up arrives, if a task goes into background mode or a subagent, the runtime can recover the active objective from the envelope instead of guessing.

Execution Lanes

Another boring but important thing: lanes.

The code has ExecutionLaneManager with lane types:

main
cron
subagent
nested

Each lane has its own concurrency limit. For example, scheduled work should not consume every execution slot, and subagents can work in parallel but not without limit.

Lane tokens have a generation. If a lane is cleared or cancelled, old tokens become stale and cannot accidentally release capacity for a newer generation of tasks.

This is not a flashy product feature, but without details like this, background execution turns into races very quickly.

Queued Follow-Ups

A user can write a new message while the agent is still working.

That should not corrupt the current run.

Spectrion has queued follow-ups: the runtime stores the follow-up message, snapshot, delivery order, and can cancel, reorder, or edit queued messages. After the current run finishes, the next input is delivered cleanly.

So a follow-up during execution is not "the user interrupted the stream". It is a separate queued event.

Approvals

Approvals are not just a UI button.

ApprovalManager stores pending requests, keeps the last 100 approval records, waits up to five minutes, and supports persisted auto-approve rules:

always
same arguments
until date
count N times

Together with Business policy this gives two levels:

policy says whether the action is allowed at all
approval says whether it may happen now

Skills, Project Context, Hooks, and Directives

Runtime can inject skills without stuffing one huge prompt into the model. Enabled skills are listed lightly, while full instructions are injected only for activated skills and within budget.

Project context is not blindly read from the folder. There is trust state, skipped sources, parse errors, project skills, project agents, and name conflicts. Untrusted project context should not become runtime instruction.

There are hooks:

agentBootstrap
agentBeforeRun
agentAfterRun
sessionCreate
sessionDelete
toolBeforeExecute
toolAfterExecute
memoryFlush
messageReceived
messageSent
preCompact
postCompact

This should be described carefully: plugin hook APIs and registries exist, but not every hook is deeply wired through every LLM/tool loop path yet. It is an extension surface, not a claim that the whole runtime is already middleware-driven end to end.

Directives give fast runtime control from chat:

/model
/think
/elevated
/verbose
/reset
/status
/compact
/skill
/agent

Together, this turns Spectrion from a single LLM call into a controllable execution environment.

Provider Layer, Capability Routing, and Diagnostics

Another layer that is almost invisible from the outside: the runtime should not be tied to one model or one API format.

There is a provider layer for Spectrion Pro proxy, direct Anthropic/OpenAI providers, Ollama/local models, custom OpenAI-compatible endpoint, custom Anthropic-format endpoint, and Apple Foundation/on-device provider.

This is not only about picking "the smartest model".

Different providers handle these things differently:

tool schemas
streaming
vision
max context
reasoning/thinking settings
prompt caching
multimodal content
fallback

So ToolDefinition can format schemas for OpenAI, Anthropic, and Ollama. OpenAI strict schema is enabled only when it is actually safe: when all properties are required. Otherwise, making a schema strict can make the tool harder for the model to call or break the call entirely.

ProviderRequestBuilder does boring but important work: it cleans system/history messages, prevents the active task envelope from being duplicated, preserves it even in emergency-compaction mode, and builds a provider-visible request where the model can see the correct task state.

There is also ProviderVisibleContextDiagnostics. It breaks visible context into categories:

system
skills
project context
memory
tools
MCP
history
images
rehydration
active envelope
free space
compact threshold

This helps avoid guessing why the model did not see an important instruction. You can inspect what actually went to the provider.

There is also exact request diagnostics: before OpenAI/Anthropic/Proxy HTTP calls, the provider saves a fingerprint, redacted request-body preview, token estimates, tool/image counts, and flags such as context-management. This is not a permanent full raw-request log, but it is much better for debugging provider-visible behavior than looking only at the chat transcript.

The server layer is not just a thin proxy for chat completions either. It has endpoints for completions, media analysis, embeddings, rerank, media encoding, image edit, STT, TTS, voice cloning, video generation, watchdog, and steward. On top of that sit capability resolver, tier/account/provider fallback, retries, SSE sanitization, usage logging, and business/subscription accounting.

The idea is that the model should be a replaceable part of the system. The agentic loop should live above a specific provider API.

ToolCatalog and ToolExecutor

Tools are the obvious part of an agent system. But early on I separated two different problems.

First:

which tools should the model see?

Second:

how do we safely execute the selected tool?

The first is handled by ToolCatalog.

The second is handled by ToolExecutor.

Why You Cannot Put All Tools Into the Prompt

When there are only a few tools, you can pass all schemas to the model.

But if there are tens, hundreds, or thousands of tools, this starts hurting quality.

Context grows. The model chooses worse. Similar tools compete. Tool descriptions take space that should belong to the task, memory, and working state.

So tools are split into layers.

There is a built-in core: native/system tools for web, files, device, calendar, reminders, notes, notifications, memory, knowledge base, workflows, subagents, Shortcuts, Health, maps, weather, PDF, Office, ZIP, cloud files, and other surfaces.

On macOS there are desktop capabilities on top: shell, filesystem, Git, AppleScript, screen capture, browser automation, Docker sandbox, process control, patching, and other tools that are impossible or should live in a different execution environment on a phone.

There is also Shop / external capability layer: today it has more than 1495 tools on top of the built-in core. But this does not mean 1495 schemas are inserted into the prompt at once.

The important part is activation.

The principle is:

the model should not see every available tool,
it should see the relevant tools for the current step

Activation considers:

keywords
categories
explicit activation
user language
project context
policy
negations

For example, if the user says:

do not use the web, check only local files

then web tools should not activate just because the query looks like research.

That is a small detail, but it matters for trust.

The Executor Matters More Than the Catalog

The catalog decides what to show the model.

Real guarantees live in the executor.

ToolExecutor checks:

whether the tool is allowed by policy
whether approval is required
whether arguments are too large
whether result is too large
whether the task has been cancelled
whether timeout expired
whether the tool can run in this environment
whether an audit event should be recorded

The key principle:

policy does not only hide the tool from the LLM,
policy blocks execution in ToolExecutor

If a tool is forbidden, removing its schema from the prompt is not enough. The model may try to call it from older context, a nested call, a workflow, or a malformed tool call.

The boundary must live at execution level.

Otherwise it is not a security boundary. It is decoration.

Artifacts and Native UI

A tool result in Spectrion is not only a string.

A tool can return:

text
summary
image
file
artifact reference
structured payload

There is an artifact contract. It gives stable artifact IDs, attachment references, edit references, storage scope, and validators that prevent the model from accidentally answering with a raw absolute path like /Users/.../file.png.

This sounds small until the agent starts working with images, PDFs, Office files, downloaded archives, generated media, and long-running tool results.

If an artifact is already attached to the message, the model gets guidance not to attach it again. If a file or image needs editing, the model should use an artifact reference rather than inventing a path.

There is also another layer: render_ui.

The agent can return not only text, but a JSON spec for native UI. The A2UI parser supports layout/content/input/data/media/composite components:

vstack, hstack, zstack, scroll, grid
text, image, icon, divider, spacer
button, textfield, toggle, slider, picker, stepper
list, table, chart
map, webview
card, alert, sheet, form, progress

There are limits on depth and node count, so the model cannot generate an infinite UI tree.

The practical value: the agent can return a small working interface inside the answer. For example, an approval form, comparison table, task dashboard, workflow card, or research result panel.

This is another reason Spectrion is not limited to an assistant message. Output can be UI state, artifact, or action.

Project Workspace, Mutation Journal, and Rollback

Code/workspace tasks have their own project layer.

Workspace does not mean "the agent can read and write anywhere". The project model has capabilities:

browse files
edit files
view git changes
search files
load project context
load project skills
load project agents

There is a path policy that should keep the agent inside the selected workspace root. File tree service handles hidden files, binary files, symlinks, unreadable entries, max depth, max entries, max editable bytes, and .gitignore.

Project context is loaded separately from ordinary memory. It has trust state, skipped sources, parse errors, project skills, project agents, and conflict detection. If project context is not trusted, it should not become runtime instruction.

For mutations there is WorkspaceMutationJournal and WorkspaceChangeTracker. They record file changes with before/after hashes, snapshots, and rollback eligibility. If a change can be rolled back, the runtime knows where the snapshot is. If not, that is explicit too.

This matters for macOS/CLI scenarios. When the agent edits a project, "I changed files" is not enough. You need a trace: what changed, how it is evidenced, and whether it can be rolled back.

Tasks the Agent Should Not Drop

One of the main problems with AI agents: the model can end with a nice answer even though the task is not done.

The user says:

Prepare the feature launch.
Check the copy, gather bugs,
prepare the changelog, write the post,
make the release checklist,
and do not stop until every item is closed.

A normal assistant often turns this into a good list.

But the user did not ask for a list. They asked the agent to carry the work.

That requires task state.

Simplified:

TodoManager
  -> pending
  -> in_progress
  -> completed

And a more project-level layer:

TaskStore
  -> status
  -> priority
  -> dependencies
  -> blockers
  -> assigned agent
  -> claim session
  -> claim expiration

After the agent loop, the runtime checks:

are there pending items?
is there in_progress work without result?
were there tool failures?
are there blockers?
did the agent say "done" too early?

If the task is not closed, the runtime can continue instead of stopping.

My principle:

the agent should not stop only because
the model produced a nice final message

This changes UX.

The user should feel not "I received text", but "the task is being carried".

Watchdog and Steward

The task board stores state, but there also needs to be a checking layer.

Spectrion has ChatWatchdog and AgentSteward.

Watchdog looks at the agent stopping and asks:

is it actually okay to finish this run?

Example:

User:
  Check 5 sources and compare options.

Agent:
  Checked 3 sources and wrote "done".

Watchdog:
  User requirement was 5 sources.
  Evidence exists only for 3.
  Completion rejected.

Runtime:
  Continue with the remaining sources.

Steward is a more general verification layer.

It can run in modes:

completionJudge
toolResultVerify
taskBoardGroom
subagentSupervise
workflowPreflight
workflowPostflight

It checks:

whether the task is actually complete
whether a tool result looks incomplete
whether a workflow is safe before launch
whether approval is required
whether a subagent result is valid
whether the task board should be updated

This is not a magical quality guarantee.

But architecturally it is better than hoping the main agent always realizes what it forgot.

The core idea:

the executor should not be the only judge of its own work

How deeply can the steward challenge the main model?

Not infinitely and not without rules.

It can return verdicts such as:

no_action
continue_now
create_tasks
blocked

For completionJudge, that means: if the main agent says "done", but evidence does not match the original task, the steward can reject completion and ask the runtime to continue.

For workflow/subagent work, it can suggest follow-up tasks, flag a problem, block a dangerous result, or send work back for revision.

But there are limits around this.

ChatWatchdog makes at most two nudges per user request. In steward mode it waits two seconds after the turn stops, and idle checks run after five minutes. Watchdog context is limited to recent messages, steward request uses a low budget and maxTokens: 10000, and network timeout is 30 seconds.

AgentSteward also has client-side budget gates:

do not run in Low Power Mode
do not run on bad network
do not run when runtime is busy
do not run when queued user input exists
do not repeat an already applied idempotency key
cool down after rejected actions
limit calls per hour / day

If steward is unavailable, steward mode does not automatically fall back to an older, less governed judge. Skipping the check is safer than running a different uncontrolled loop.

So steward is not "a second model arguing forever with the first".

It is a policy-gated review layer with idempotency, budgets, cooldowns, and strict continuation limits.

Proactive Tools: The Agent Can Raise Its Hand

Normal tools are called when the model decides to call them.

I wanted the agent to be able to observe.

That is how proactive tools appeared.

A proactive tool is a scripted tool running in a background polling loop. It can wake up on a schedule and check a page, API, file, metric, or some other source.

If nothing happened, it returns:

null

If something happened, it returns an alert:

{
  "type": "price_drop",
  "oldPrice": 349,
  "newPrice": 289,
  "url": "..."
}

The alert enters ProactiveExecutionQueue, and the runtime handles it as a new input.

The path:

JS monitor wakes up
  -> checks source
  -> returns alert
  -> ProactiveExecutionQueue
  -> AgentRuntime handles proactive alert
  -> agent reasons about it
  -> user gets result / task / notification

Example:

Watch the competitor changelog.
If a new release or pricing change appears,
write a short competitive note and create a task.

A normal chat does not wake itself up.

An agent runtime can.

In code, proactive scripted tools have their own polling manager: max concurrent checks, persisted running flags, auto-reconnect, structured JSON alerts, alert routing, and auto-stop after repeated errors.

Heartbeat: The Agent Exists Outside the Chat

Proactive tools are observers.

There also needs to be a clock.

That is HeartbeatManager.

It periodically performs service work:

scheduled tasks
scheduled workflows
maintenance
morning briefing
periodic check-ins
Evolution cycle
Mesh leadership

Periodic check-in looks like this:

runtime:
  check whether anything needs attention

agent:
  HEARTBEAT_OK

If nothing is happening, the user sees nothing.

If there are blockers, alerts, follow-ups, or claimable tasks, the runtime can continue work or send a notification.

This matters because an agent should not exist only at the moment the user writes a message.

Some tasks need to live for hours, days, or weeks.

Channels and Native Surfaces

The in-app chat is only one entry point.

There is a ChannelManager and channel implementations:

Telegram bot
Telegram user channel
Slack
Discord
WhatsApp
Email

ChannelType also includes SMS and custom channels, but those are reserved types. In the current ChannelManager.connect, ready implementations exist for Telegram, Telegram User, Slack, Discord, WhatsApp, and Email; SMS/custom should be treated as unsupported until an adapter is added.

A channel has config, credentials, enabled/autoConnect flags, and status. Configs are stored locally and synced through Mesh. Incoming messages arrive as ChannelMessage, and responses go back through ChannelManager.sendResponse with retry.

Heartbeat checks channel connections separately, and auto-connect is performed only by the Mesh execution leader. This matters so two devices do not both reply in the same Telegram/Slack thread.

From the outside:

Telegram / Slack / WhatsApp / Email
  -> ChannelManager
  -> AgentRuntime
  -> same tools / memory / policy / approvals
  -> response back to channel

There are also native Apple ecosystem surfaces:

widgets
Live Activities
Share Extension
App Intents / App Shortcuts
Watch companion
macOS menu bar
macOS Services
voice overlay
wake word daemon
camera stream

There are important boundaries here too. WidgetKit, App Intents, WatchConnectivity, Share Extension, and Spotlight are wired as native surfaces. Live Activity UI exists, but starting a Live Activity is intentionally disabled in the manager right now, so the honest phrasing is "the surface is prepared", not "it always runs in production". Wake word is VAD + Apple Speech recognition for a phrase, not a separate embedded hotword model.

These are not separate little assistants. They should lead the user into the same runtime.

Examples:

selected text in any macOS app
  -> Services -> Ask Spectrion
  -> runtime receives the task

said wake phrase
  -> voice command captured
  -> runtime handles the task
  -> TTS responds
  -> wake-word listens again

sent a file through Share Extension
  -> attachment enters the app
  -> agent can analyze / save / create task

For me this is an important part of the idea: if the agent runtime is real, it can have many inputs and outputs, but state and rules must remain shared.

Server Layer

Part of Spectrion lives outside the app.

The server has routes for:

auth
subscription
chat proxy
webhooks
admin
OAuth
web
OpenAI plan proxy
CLI deploy
plugins
channels
Telegram user
community store
mesh
business
compat

/v1/chat is not just "forward request to LLM". It has subscription/business usage checks, multi-provider routing, retries, SSE streaming, active stream limits, provider fallback paths, account pool, capability resolver, and separate endpoints for capabilities:

chat completions
media analysis
embeddings
rerank
media embeddings
image edit / generation
speech-to-text
text-to-speech
voice clone / delete
watchdog / steward
video generation

/v1/channels stores channel registrations, encrypts credentials with AES-256-GCM, supports server-side polling for Telegram/email, and exposes pending messages to devices.

/v1/mesh handles pairing, device registration, device list, polling fallback, and ack for relay messages.

There is also a community store for skills/tools/MCP, plugin routes, CLI deployment, Telegram userbot, desktop releases, app version/config, and a large Business API.

This matters because Spectrion is not only a local app. The native runtime brings execution closer to the user and device, while the server layer handles provider routing, subscriptions, sync, channels, Mesh relay, store, business control plane, and capability endpoints.

Workflows Inside the Agent Runtime

One tool call is enough for simple actions.

But real tasks often become graphs.

Example:

Every Monday morning, check three sources.
Collect a summary.
If there are important changes, create a task.
If risk is high, notify me.

That is a workflow.

The user-facing surface for this in Spectrion is the manage_workflows tool.

Node types include:

trigger
action
condition
delay
transform
llm
http
script
loop
parallel
notify
end

The important part: workflow should not be a separate automation island.

If a workflow calls a tool, it should go through the same ToolExecutor as the normal agent.

That means the same:

approvals
policies
timeouts
audit
result limits
Business gates

Simplified:

WorkflowEngine
  -> node execution
  -> ToolExecutor
  -> policy / approval / timeout / audit
  -> result
  -> next node

This lets workflows, proactive monitors, and chat agent live inside one runtime.

There is a parallel node inside workflow graphs, but the WorkflowEngine itself is guarded as a single-run engine. So I would not describe it as "unlimited parallel workflow executions". The honest statement: a graph can have parallel branches within a run, while system-level concurrency is handled through heartbeat, proactive queue, and lanes.

The user can write an ordinary sentence:

Watch this API.
If the status changes, check details,
write a short explanation,
and create a task if a reaction is needed.

The agent can:

search existing tools
  -> no matching tool
  -> create scripted monitor
  -> test it
  -> register it
  -> build workflow
  -> schedule it
  -> notify only when something changes

This is not "AI wrote instructions for configuring automation".

This is the agent building automation from the conversation.

Self-Extension: The Agent Can Create Tools

I wanted the agent not to be limited to the tools that shipped with the app.

If the required tool does not exist, the agent can create it.

There is a tool for that: create_tool.

But the boundary matters.

"The agent writes tools for itself" sounds good, but without restrictions it is dangerous.

So self-extension is not just "save JS into a file".

The process looks like:

agent writes tool
  -> checks similar tools in catalog / Shop
  -> reads sandbox API reference
  -> validates syntax
  -> runs sandbox/security audit
  -> checks built-in name collisions
  -> persists ScriptedToolDefinition
  -> registers in ToolCatalog
  -> activates tool
  -> runs auto-test with provided args
  -> keeps version history / rollback point

Restrictions:

sandbox
versioning
secret fields
rollback
test runs
policy gates
approval requirements

On iOS, dynamic tools run through a JS sandbox. There is a limited API for HTTP, persistent KV, crypto, sandboxed FS, DB, HTML/image helpers, and other things needed for integrations.

On macOS the surface is broader: scripted tools can be JavaScript, Python, Shell, Ruby, Node.js, Go, Rust, Swift, and Perl. If a runtime is not available locally, Docker fallback is possible.

Proactive mode is also important: a tool can get an interval and instructions, wake up in the background, and return an alert only when something important happens.

create_tool is not only create. In code it has actions for edit, list, delete, test, templates, api_reference, history, rollback, start/stop/status proactive tools. So self-extension is a lifecycle, not one-shot file generation.

The broader the surface, the more important policy becomes.

Main principle:

the agent can expand capabilities,
but it must not grant itself new permissions

If a tool needs a new secret, the user must fill it explicitly.

If a tool performs a mutating action, approval may be required.

If a tool is forbidden by policy, the executor must reject it even if the model created it.

Personal Tools and Community Store

Self-extension is not limited to local experiments, but it does not publish anything automatically.

When the agent creates a tool through create_tool, it is a local/project/user capability. It can be used immediately in the current runtime, workflow, or proactive loop, but publishing it externally should be a separate deliberate action.

There is a Community Store / Shop layer.

The server has routes for:

community skills
community tools
community MCP servers
search
install
reviews
downloads
my published items
pending / approved / rejected moderation states

The model:

agent creates personal/project tool
  -> user tests it locally
  -> tool can be reused in workflows
  -> tool can be published to community store separately
  -> other users install approved tools from Shop

This boundary matters. The agent can quickly build a missing tool for itself, but community distribution should not happen without a review/publish flow.

Memory V2: Scoped Memory, Retrieval, and Rollback

It is easy to imagine agent memory as one large memory.md.

For a strong agent, that is a poor model.

In Spectrion, memory now consists of several layers:

legacy markdown memory
structured Memory V2 records
MemoryProposal queue
snapshots / rollback
semantic memory
SQLite vector store
conversation recall FTS
session working memory
project context bridge
memory policy

This is not one folder of notes. It is a runtime subsystem.

Structured Memory

MemoryRecord has scope:

user
agent
conversation
project
global

And type:

preference
fact
instruction
decision
projectRule
workflow
correction
summary

Plus metadata:

source
confidence
sensitivity
expiresAt
status
provenance
linkedVectorChunkIds
createdAt
updatedAt

Source is also typed:

manual
automaticExtraction
conversationSummary
document
skill
projectRule
migration
tool

Sensitivity:

publicFact
personal
privateNote
secret

Status:

active
archived
rejected

Why does this matter?

Because "remember this" can mean different things.

Remember that I dislike long answers.

That is a user preference.

Remember that this project cannot change the API without approval.

That is a project rule.

Remember until the end of the release that we use feature flag X.

That is a temporary rule with TTL.

Remember this only for the current conversation.

That is conversation scope.

A flat memory cannot reliably distinguish these cases.

Memory Proposals

Not every memory change should immediately become fact.

There is MemoryProposal with operations:

create
update
archive
delete

And statuses:

pending
approved
rejected

This matters for automatic extraction and business scenarios: the agent can propose a record, but the runtime or user decides whether to apply it.

Important not to overclaim here. The MemoryProposal pipeline exists, but this does not mean every model response automatically becomes a neat V2 proposal. A visible part of the current proposal flow is tied to legacy MEMORY.md migration and sensitive candidates. Manual memory.save does dual-write: legacy markdown, Memory V2 record, and semantic chunk for search.

Store, Snapshots, and Rollback

MemoryV2Store stores records, proposals, and snapshots. It can:

upsert record
deduplicate equivalent records
archive / delete
scoped reset
reset all
search records
export markdown
submit proposal
approve / reject proposal
sync project context
create snapshot
rollback snapshot
rollback with linked vector restore
produce vector repair report

Snapshots store records/proposals and linked vector chunks when those chunks are included in the snapshot. So rollback covers structured memory state and can restore linked semantic chunks through the vector restore path, but it is not a promise to restore every embedding everywhere.

Before saving, a redactor runs. It is best-effort regex redaction for:

api keys
tokens
passwords
secrets
bearer tokens
/Users/... paths

This is important: the agent should not turn long-term memory into a random secret dump. But this is not a DLP system or a mathematical privacy guarantee. It is a runtime protection layer.

Memory Tool

The user-facing surface is the memory tool.

It supports actions:

save
read
recall
search
list
delete_entry
clear
index
stats

On save, it performs dual-write: legacy markdown memory is saved for compatibility, while a MemoryRecord is created in Memory V2 and a semantic chunk is added for search.

On recall, semantic search is used. If semantic memory is unavailable, there is keyword fallback.

On index, conversation history can be reindexed.

On stats, you can see how many chunks exist in memory, conversations, documents, and skills.

Memory Policy and Conversation Modes

Memory should not always behave the same way.

MemoryPolicy decides on operations:

readPersistentMemory
recallMemory
recallConversationMemory
writeManualMemory
writeAutomaticMemory
indexPersistentMemory
indexConversation
summarizeConversation
flushSemanticMemory
sameConversationRecall
toolEvidenceRecall
deleteMemory
clearMemory
readStats

Conversation memory modes change behavior:

full
auto
hybrid
standard
toolsOnly
off
isolated

For example, in standard/auto cross-conversation recall can be blocked; in toolsOnly, only same-conversation/tool-evidence recall can be allowed; in off, almost every memory operation is blocked except stats.

So memory is not global magic. It is a policy-controlled part of the runtime.

Retrieval Planner

When runtime prepares a prompt, it does not just insert all memories.

MemoryRetrievalPlanner builds a mixed plan from:

structured records
semantic vector results

The plan filters by:

status
expiry
scope
cross-conversation rules
policy decisions
linked semantic chunks already covered by records
token budget
max items

Scoring for structured records considers:

query overlap
scope priority
confidence
recency

For semantic candidates:

vector score
keyword score
combined score
source priority
recency

MemoryRuntimeContextBuilder then renders selected records into a dedicated Memory V2 block and can include a debug trace: what was included, what was excluded, and why.

This is the difference from simple memory:

good memory is not "insert more",
but select the right things for this task and token budget

Semantic Memory and Vector Store

This is a separate runtime layer.

Semantic memory works with chunk sources:

memory
conversation
document
skills

The current runtime indexes persistent memory, conversation transcripts, and documents from knowledge base. Source skill is supported at the model/statistics level so the layer can be used for skill retrieval.

The code has extraction patterns for multiple languages, including Russian, English, German, Spanish, French, Arabic, Hindi, Japanese, Korean, Portuguese, and Chinese.

VectorStore stores chunks in SQLite:

chunks table
chunk_embeddings table
FTS5 index
LSH buckets
ANN graph edges
provider/model/namespace metadata

So semantic memory is not just an array of embeddings in process memory. It is a local index with metadata, search, stats, migration, and repair paths.

Knowledge Base

Next to memory there is KnowledgeBase.

It is not exactly user memory; it is a RAG layer for documents.

It can import:

pdf
rtf / rtfd
txt
md / markdown
swift
js
json
csv
xml
html
py
rb
log

Documents are chunked, indexed in VectorStore, searched with vector search and keyword fallback, and synced through Mesh as metadata/text deltas.

For the agent, memory and knowledge base are different things. Memory is rules, preferences, decisions, and state. Knowledge base is documents that can be searched and cited during work.

Conversation Recall and Session Working Memory

Long conversations get two more layers.

The first is conversation_recall.

It is a tool for searching older messages in the current conversation. Under it there is a SQLite/FTS index, BM25, an include-tool-details option, and background indexing. It is useful when session working memory says: "the older transcript contains exact source id; fetch the original text if needed."

The second is session working memory.

It is built from saved transcript and contains:

current state
task specification
structured memory facts
files and functions
workflow
recent files and tool artifacts
evidence snippets
errors and corrections
documentation references
learnings
key results
worklog
span summaries
source anchors
compacted context summary

It has budgets:

max stored tokens
max section tokens
max snippet characters
max structured memory facts
max evidence snippets

If LLM extraction is unavailable, there is a deterministic fallback.

So even if a long transcript no longer fits in context, the runtime keeps working memory and source anchors instead of asking the user to remind it what happened.

Project Context Bridge

Project context can also become Memory V2 records, but only if the manifest is trusted.

MemoryProjectContextBridge converts project rules/instructions into scoped records, and MemoryV2Store.syncProjectContext can archive stale records when a rule disappears from the project.

This matters for code/workspace mode: project rules should not mix with the user's personal preferences.

Short Formula

Memory V2 is not just "long-term memory".

More precisely:

structured memory
  + semantic retrieval
  + session working memory
  + conversation recall
  + project scoped rules
  + policy
  + rollback

Good memory is not remembering everything.

Good memory is remembering the right thing, in the right context, with the right boundaries.

Subagents and Sessions

Some tasks are naturally parallel.

For example:

research competitors
check technical documentation
collect integrations
prepare a draft response

You can do this sequentially, but it is better to delegate.

There are two modes.

Blocking delegation:

delegate_task
  -> parent agent sends task
  -> child agent works
  -> parent waits for result
  -> parent continues

Background session:

sessions_spawn
  -> child session starts
  -> parent does not wait
  -> result can be checked later
  -> mailbox / status / history / kill

A subagent should not receive the parent's full permission set.

For example, it does not need dangerous tools:

memory writes
scheduling
self config
session management
tool/plugin management

It works in a more limited context.

Important detail: current native subagents are managed child sessions inside the app, backed by hidden conversations and async loops. They are not separate OS processes or containers. Isolation is mainly runtime/tool/context/workspace policy, not process isolation.

On macOS this is especially useful for coding/workspace tasks: a subagent can work in a separate workspace or git worktree, return a diff/artifacts, and the parent decides what to accept.

But for me the key is not "another coding mode". The important part is that subagents are part of the same runtime:

tasks
memory
tools
watchdog
policies
mesh
audit

Device Mesh: One Agent Across Devices

I started with iPhone as the first interface, but I did not want to lock the agent into one device.

Different devices have different strengths.

iPhone
  -> close to the user
  -> notifications
  -> quick decisions
  -> personal context

Mac
  -> desktop tools
  -> filesystem
  -> browser automation
  -> coding tasks
  -> shell / git / docker

CLI / Linux runner
  -> can live 24/7
  -> good for monitoring
  -> background execution

This led to Device Mesh.

The idea:

several devices become one agentic loop

In code this is not just a device list. There is pairing, cryptography, sync deltas, remote tools, handoff, and leader election.

Pairing and transport are built around encrypted Mesh: X25519/Curve25519 key agreement, HKDF-SHA256, AES-256-GCM, Keychain keys, WebSocket relay, short-lived WS token, reconnect/ping, and HTTP polling fallback. The server relay should not understand the payload: it forwards opaque nonce/payload/tag and can buffer offline messages.

Scenario:

CLI on a server watches an API at night.
If an error appears, it creates an incident task.
iPhone receives a push.
If desktop action is needed, the task goes to Mac.
The user answers from the phone.
Runtime continues where the next step is best executed.

iPhone is not necessarily the main executor.

It can be:

notification target
approval device
personal context device
decision interface

Mac can be a desktop executor.

CLI can be a 24/7 runner.

The runtime decides where a concrete step should run.

Leader Election, Handoff, and Conflicts

Mesh has two tasks that are easy to confuse.

The first: choose who does background work.

The second: synchronize changes between devices.

Leader election is based on device priority. Each device type has execution priority and notification priority. Execution leader is chosen among online peers by highest execution priority, with a stable tie-break by device id.

Roughly:

CLI / Linux runner -> high execution priority
Mac                -> desktop execution
iPhone             -> high notification priority

So proactive tools, scheduled workflows, and auto-connected channels should not start on every device at once. The execution leader starts them.

Notification routing is calculated separately: if several iPhones have the highest notification priority, a notification can be sent to all devices on that level.

Handoff is a separate path.

If a device goes background/offline in the middle of work, runtime can send MeshTaskHandoff to the best online peer. Handoff contains:

handoffId
conversationId
originalMessage
completedIterations
sourceDeviceId
sourceDeviceName
timestamp

The receiver adds a system note to the conversation and continues the agent loop on its own device.

Sync conflicts are not solved by one global CRDT.

In MeshSyncEngine, each delta has an HLC timestamp. When remote deltas arrive, runtime merges HLC, applies changes by entity type, and updates lastSyncHLC. While remote deltas are being applied, isApplyingRemote is enabled so local hooks do not create an echo loop.

Then logic depends on data type:

conversations/messages -> idempotent create by id, update fields when delta arrives
scheduled tasks        -> upsert by task id
knowledge base docs    -> create/delete by document id
memory                 -> append remote sections not already present locally
evolution              -> take newer/higher version history or newer reset
channel/workflow/tool configs -> apply config delta by id

So this is eventual sync with HLC ordering and entity-specific merge, not "magical merge of any two edits".

If the user edits a task on iPhone while Mac has already started working on it, the change arrives as a new message/update in conversation sync. The active run on Mac does not have to instantly rewrite an already started tool call, but at the next runtime boundary it can see the new visible task through conversation state, queued follow-up, and active task envelope.

Task board items also have claim/session/expiration so multiple workers do not keep taking the same item forever. But I do not consider this a replacement for a full distributed transaction layer.

The principle:

background execution -> leader election
mid-run continuity   -> handoff
state convergence    -> HLC + entity-specific merge
dangerous actions    -> policy / approval / fail-closed

This is more honest than promising perfect conflict resolution. A real agent system is safer with clear boundaries than hidden magic.

Business Policies: Governance Must Live in the Runtime

The more an agent can do, the more governance matters.

For a personal agent, simple approvals may be enough.

For a company, that is not enough.

Companies need:

roles
departments
managed prompts
tool policies
approval policies
audit
locked UI
provider restrictions
revocation
signed manifests

The main architecture idea is the same as with tools:

policy must be applied not only in the prompt,
but in the executor

If an employee is not allowed to send emails without approval, it is not enough to tell the model:

do not send emails without approval

ToolExecutor must physically block the send action until approval exists.

In personal mode, approval is simple: ApprovalManager creates a pending request, waits up to five minutes, and then marks it expired. It keeps history of the last 100 requests and supports auto-approve rules:

always
same arguments
until date
count N times

You can approve/deny one request, approve/deny all pending requests for an agent, clear pending requests, and inspect approval stats.

In Business mode, manifest policy sits on top.

Managed mode uses a manifest approach:

Organization
  -> Workspace
  -> Department
  -> Member
  -> ManagedEnvironmentManifest
  -> Runtime policy gate

The manifest contains:

allowed tools
approval-required tools
model policy
memory policy
UI locks
audit disclosure
managed prompt rules

toolPolicy contains:

allowedTools
deniedTools
defaultDecision
approvalRequiredTools
reasonCodes

If a tool is in approvalRequiredTools, runtime returns requiresApproval. If a tool is not allowed and default decision is deny, executor must not run it at all.

Business layer has department packages for Sales, Support, HR, Finance, Operations, Marketing, Executive, Client Workspace, and vertical packages such as beauty/nails studio, fitness/gym, hotel/small property, cafe/restaurant, retail/marketplace, accounting office, and blank managed department.

But runtime entitlement does not come directly from a nice template. It is materialized into a published department profile and signed manifest.

The server model behind this is not decorative. /v1/business has organizations, members, invites, subscriptions, workspaces, departments, department profile versions, manifests, providers, integration setups, proactive review queue, audit events, guardian rules/reports, app clients, and automation items.

Department profile is versioned: draft, publish, rollback. Publish/rollback invalidates old manifests. Manifest endpoint checks org/member/seat/subscription, workspace/department membership, minimum app version, signs the manifest with Ed25519, includes policyHash, TTL, toolPolicy.defaultDecision = deny, approval-required tools, and integration setup refs without secrets.

Fail-closed matters:

if manifest is stale,
signature is invalid,
user is revoked,
or policy failed to load,
mutating actions are blocked

This is especially important for agents.

The stronger the agent, the less you can rely on prompt-level goodwill.

Business provider surface is also guarded: Spectrion Managed provider is enabled, while company-managed provider is blocked without a pilot flag. So I cannot honestly claim that any company can already plug in an arbitrary LLM provider and immediately use it in production. The code protects this with egress guard, redacted audit, and manifest invalidation when provider policy changes.

Business Store Factory

Another part of the business layer is the Store with ready automation capabilities.

The "1495+ tools" number in Spectrion is not made up. In the current seed it is:

7 curated base tools
62 verticals * 24 tool blueprints = 1488 generated tools

total: 1495 approved Store tools

There are also official skills and MCP profiles for business automation.

Base tools cover typical business flows:

omnichannel intake router
booking grid connector
1C accounting bridge
inventory reorder planner
marketplace order triage
file storage ingest
proactive follow-up watch

The mass catalog is generated by verticals and blueprints. Verticals include beauty/nails, fitness/gym, hotel, cafe/restaurant, retail/marketplace, accounting, legal, healthcare, real estate, education, logistics, auto service, and others. Connector families include Telegram Bot API, WhatsApp Cloud API, Instagram Messaging API, VK, SMTP/OAuth, TravelLine, Bnovo, OPERA Cloud, YCLIENTS, Google Sheet/CSV, 1C Fresh/OData/file exchange, QuickBooks, iiko, r_keeper, Bitrix24, amoCRM, Ozon, Wildberries, Shopify, Google Drive, SharePoint, and SFTP.

Important: this does not mean every tool immediately performs dangerous write actions in an external system without setup.

Business Store is review-first and dry-run-first. Many templates return:

setup_required
dry_run
fetch_sample
provider_review_required
external_write_pending_approval

So Store gives a company a fast start for integrations and workflows, but runtime still has to consider secrets, setup, approvals, audit, and fail-closed behavior.

This is essential for business scenarios: a large capability catalog is useful only if it does not bypass governance.

Approval Pipeline and Review Queue

Two things should be separated honestly.

The code already has:

approval-required tools in manifest
client pending approvals with timeout
auto-approve rules
proactive review queue
guardian rules
business notification channels
audit events

Guardian rules can have response mode:

monitor_only
approval_required
deny

And notification mode:

none
digest
immediate

Proactive review queue is used when automation/proactive run prepared an external action but should not execute it immediately. Admin can approve/decline/archive the item, and the external action is not executed automatically as a side effect of review. In the dry-run/review path, it is explicitly recorded that external action and mutation were not executed.

So business runtime can:

deny an action
require approval
place a proactive item into review queue
write an audit event
send digest/immediate notification
block mutating action on stale/invalid manifest

But it is not a universal BPMN engine for approval chains.

For example:

if approver A does not answer in N minutes,
escalate to approver B,
then C,
then auto-decline

should not be described as a generic ready-made mechanism. The current runtime has a safer base behavior: request expires, mutating action is not executed, event remains in history/audit/review surface.

For me this is an intentional boundary. An agent runtime needs fail-closed and audit first; complex approval graphs can come after.

Evolution: The Agent Can Improve, but Not Grant Itself Permissions

Self-improvement sounds attractive, but it is a dangerous area.

You cannot simply let the model rewrite its prompt, change providers, expand permissions, or enable destructive workflows.

So Evolution is built through signals, proposals, policy gate, and rollback.

Important detail: this is a native runtime mechanism, not a business server that rewrites the app by itself. In code it is typed signals, signal store, proposer/critic, deterministic policy gate, rule store snapshots, and rollback.

The flow:

runtime events
tool failures
watchdog misses
user feedback
subagent results
workflow issues
  -> EvolutionSignalProducer
  -> EvolutionSignalStore
  -> proposals
  -> critic / policy gate
  -> snapshot
  -> limited rule mutation
  -> rollback if needed

What can be improved:

tool descriptions
planning rules
validation rules
workflow hints
project rules
runtime guidance

What cannot be done automatically:

grant new permissions
change API keys
bypass approvals
enable forbidden tools
expand destructive actions

The principle is the same:

the agent can learn,
but it must not grant itself new rights

One Full Trace

Suppose the user writes:

Watch the competitor changelog.
If a new release or pricing change appears,
write a short note, create a task, and notify me.

In a normal chat, the answer would be:

Sure, here is how you can configure it...

In an agent runtime this becomes a trace:

1. User message enters runtime
2. Runtime classifies it as long-running monitoring task
3. Context builder injects relevant memory/project rules
4. ToolCatalog activates web/proactive/workflow/task tools
5. Agent checks whether a suitable monitor already exists
6. If not, agent creates scripted monitor
7. Tool is validated, sandboxed, tested, and registered
8. Workflow is built:
     trigger -> fetch -> diff -> condition -> llm summary -> task -> notify
9. Approval is requested if needed
10. Workflow is scheduled
11. Baseline is stored
12. Heartbeat wakes workflow later
13. Monitor detects change
14. Alert enters ProactiveExecutionQueue
15. Runtime handles alert as event input
16. Agent summarizes change
17. Task board gets a new task
18. iPhone receives notification
19. Watchdog checks whether required steps are done

That is the difference.

The agent did not just answer. It created a mechanism that continues working after the message ends.

What Turned Out to Be Hardest

The hardest part was not calling an LLM and not adding tools.

The hard part was keeping state and recovering from an imperfect world.

1. The Model Can Be Confidently Wrong

It can say:

Done.

Even though a tool failed.

Or:

I checked all sources.

Even though evidence exists only for some of them.

That is why watchdog, steward, and task board exist.

2. Background Execution Is Not Magic

Especially on mobile.

You cannot just say:

let the agent always run in the background

You have to deal with platform limits, schedules, notifications, device mesh, and moving execution to a more suitable device.

3. A Thousand Tools Are Worse Than Twenty Right Ones

A large capability catalog is useful only when the model sees a relevant subset.

Otherwise quality drops.

That is why activation, suppression, categories, and policy matter more than the raw tool count.

4. Memory Without Boundaries Turns Into Noise

If the agent "remembers everything", it starts dragging stale decisions into new contexts.

You need scope, TTL, confidence, provenance, and rollback.

5. Governance Cannot Be Bolted on Later

If an agent can act, policies must live in the executor from the beginning.

Prompt-level restrictions are not a reliable boundary.

6. Self-Extension Requires Discipline

An agent that can create tools must pass through sandbox, tests, versioning, secrets, and rollback.

Otherwise it is not extension. It is uncontrolled code generation.

Why This Is Not Just n8n, Claude Code, or a Channel Gateway

I find it useful to separate agent systems by center of gravity.

There are coding agents. Their strongest loop is repository, files, shell, git, tests, patch, code intelligence.

There are automation platforms. Their strongest loop is workflow graph, triggers, integrations, deterministic automation.

There are channel gateways. Their strongest loop is Telegram, Slack, WhatsApp, email, webchat, and routing messages between channels.

Spectrion is built from a different center:

native agent runtime
  + mobile / desktop / CLI
  + tools
  + workflows
  + proactive monitors
  + memory
  + task state
  + watchdog / steward
  + device mesh
  + policies / approvals / audit

Coding can be part of the system.

Workflow can be part of the system.

Channels can be part of the system.

But the center of gravity is the agent operating loop.

Not "where does the user send a message", but "how does the task live, execute, get checked, and continue".

In that sense, Spectrion is not trying to be only an n8n replacement, only a coding harness, or only a message gateway.

n8n/Make/Zapier are good as external automation graphs. In Spectrion, workflow lives inside the agent runtime: it can call the same tools, pass through the same policies, request approvals, return proactive alerts into the reasoning loop, and use a tool the agent just created.

Claude Code and similar coding agents are strong in the software engineering loop. Spectrion can also do code/workspace tasks on Mac, but it is not limited to IDE or terminal: the same runtime should live on iPhone, Mac, CLI, schedules, workflows, memory, device mesh, and Business policies.

A channel gateway is useful as a message entry point. But if the message does not enter a shared execution layer with task state, watchdog, tools, memory, policies, and background loop, it remains message routing rather than a full environment for agentic work.

Current Architecture in Short

If you put everything together, it looks like this:

Inputs:
  chat message
  scheduled task
  proactive alert
  workflow event
  subagent result
  channel message
  voice command
  App Intent / widget / share extension
  heartbeat

        |
        v

AgentRuntime:
  context preparation
  active task envelope
  session working memory
  memory retrieval
  project context injection
  skill activation
  tool activation
  model selection
  provider request building
  execution lane acquire
  LLM loop
  tool parsing
  tool execution
  approval wait
  result handling
  artifact handling
  continuation logic
  compaction / rehydration

        |
        v

Execution layer:
  ContextManager
  MessageCompactor
  MicroCompactor
  ActiveTaskEnvelopeStore
  ProviderManager
  ProviderRequestBuilder
  ProviderVisibleContextDiagnostics
  ToolCatalog
  ToolExecutor
  AgentArtifactContract
  A2UIRenderer
  WorkflowEngine
  ProactiveExecutionQueue
  HeartbeatManager
  ChannelManager
  ApprovalManager
  ProjectContextLoader
  ProjectFileTreeService
  ProjectSearchService
  ProjectWorkspaceAuditLog
  WorkspaceMutationJournal
  TodoManager
  TaskStore
  AgentSteward
  ChatWatchdog
  Memory V2
  SemanticMemory / VectorStore
  KnowledgeBase
  ConversationRecall FTS
  SkillCatalog
  HookManager
  ExecutionLaneManager
  Device Mesh
  Business Policy Gate

        |
        v

Outputs:
  assistant message
  tool artifact
  rendered native UI
  notification
  task update
  memory proposal
  workflow schedule
  approval request
  workspace mutation snapshot
  subagent run
  channel response
  voice/TTS response
  Live Activity surface update
  audit event

For me, this is the essence of Spectrion.

Not a set of separate automation features.

One runtime through which different forms of agentic work pass.

Where This Architecture Is Useful

It works best not for questions like:

answer X

But for tasks with a loop:

observe -> think -> act -> verify -> continue

Examples:

Every morning at 8:30, look at my calendar,
reminders, and open tasks.
Make a short briefing:
what matters today,
where time conflicts exist,
what I promised but did not schedule.
If something requires a decision, create a task and notify me.

Or:

Help me prepare a feature launch.
Check copy, gather bugs,
prepare the changelog, write the post,
make the release checklist,
and do not stop until every item is closed.

Or:

Watch this page.
If a new version appears,
briefly explain the changes
and create a task to update the project.

Or:

Let CLI monitor the server,
Mac do heavy operations,
and iPhone only notify me and receive decisions.

These are tasks where a normal chat loop quickly becomes too weak an abstraction.

Conclusion

I did not start with "build a chat and then attach tools".

From day one I wanted an agent that can live beyond one message.

Externally it looked like an AI agent for iPhone.

Internally the architectural bet was different:

agent runtime first,
chat interface second

The model can change. Tools can expand through Shop or be created by the agent. Workflows can be built from conversation. Tasks can continue in the background. Memory needs boundaries. Watchdog should verify that the agent did not stop too early. Policies must apply not only in the prompt, but in the executor. iPhone, Mac, and CLI should be parts of one agentic loop.

For me the main conclusion is:

an AI agent is not an LLM with functions

An AI agent is a runtime that accepts events, carries state, executes tools, verifies results, and continues work after a single message has ended.

That is what I am building in Spectrion.

Where to look:

Website: https://spectrion.app
App Store: https://apps.apple.com/app/spectrion-agent-ai/id6759151825