One of the biggest mistakes teams make with AI agents is applying traditional observability patterns to non-deterministic systems.
A classic backend request is relatively stable:
Request → Service → Database → Response
An agent execution is not.
Request
↓
Planning
↓
Memory Retrieval
↓
Tool Calls
↓
Validation
↓
Retries
↓
Response
Two identical prompts may generate completely different execution paths.
That changes observability entirely.
The real challenge is no longer monitoring infrastructure.
It’s understanding reasoning.
This is where AWS AgentCore becomes interesting.
Not as another framework.
But as a runtime layer for operating probabilistic systems.
What You Must Observe in an Agentic Platform
Most teams only track:
- latency
- token usage
- request count
That is not enough.
You need reasoning-level telemetry.
At minimum, your platform should expose:
| Signal | Why it matters |
|---|---|
| reasoning depth | detects recursive loops |
| tool execution graph | explains orchestration complexity |
| retries | detects unstable planning |
| memory context size | identifies memory drift |
| tokens per successful execution | measures reasoning efficiency |
| planning duration | exposes unstable agent behavior |
Without those signals, debugging becomes almost impossible.
Cognitive Tracing with OpenTelemetry
The most effective pattern today is treating reasoning steps like distributed tracing spans.
Every cognitive boundary becomes observable:
- planning
- retrieval
- tool execution
- validation
- retries
With Strands and OpenTelemetry, this is already straightforward.
from strands import Agent
from strands.telemetry import StrandsTelemetry
StrandsTelemetry().setup_otlp_exporter(
endpoint="http://otel-collector:4318/v1/traces"
)
agent = Agent(
model="anthropic.claude-sonnet-4-20250514-v1:0",
system_prompt="You are an SRE assistant"
)
response = agent(
"Investigate elevated API latency"
)
Once exported into Datadog, Grafana or CloudWatch, traces stop showing infrastructure execution.
They start showing reasoning behavior.
That is the key shift.
What Actually Breaks in Production
One thing surprised me the first time we pushed an agentic workflow under real traffic.
Nothing was technically failing.
CPU was fine.
Memory was fine.
Latency looked acceptable.
But the platform still felt unstable.
The issue was hidden inside the reasoning traces.
One specific agent kept re-planning after failed CloudWatch queries.
The execution looked roughly like this:
trace_id=7f21...
planner
├── retrieve_memory
├── tool:cloudwatch
├── retry:cloudwatch
├── replan
├── tool:cloudwatch
├── retry:cloudwatch
└── escalate
At first, we completely missed it because our traces were sampled by latency.
And ironically, those executions were not even the slowest ones.
The real issue was cognitive instability.
The agent was stuck in a planning/retry loop that slowly amplified token usage and tool fanout.
What made it worse is that our OTEL collector started struggling with span cardinality because tool names and dynamic arguments were generating too many unique combinations.
At some point, traces became harder to query than the incident itself.
That incident completely changed how I think about observability for agents.
Infrastructure telemetry alone was useless.
We needed reasoning telemetry.
GenAI Semantic Conventions Matter
The hardest part of operating agentic systems is that failures rarely look like traditional failures.
Most executions technically succeed.
The problem is that reasoning becomes unstable.
For example, a problematic trace may look like this:
trace_id=7f21...
planner
├── retrieve_memory
├── tool:cloudwatch
├── retry:cloudwatch
├── replan
├── tool:cloudwatch
├── retry:cloudwatch
└── escalate
From an infrastructure perspective, nothing failed.
But cognitively, the agent is clearly unstable:
- retries increase
- planning loops appear
- tool orchestration degrades
- latency grows exponentially
This is exactly why traces must expose reasoning semantics.
Otherwise production incidents remain invisible.
GenAI Semantic Conventions Matter
One important evolution is the emergence of GenAI semantic conventions for OpenTelemetry.
Traditional spans expose infrastructure metadata.
Agentic spans must expose reasoning metadata.
For example:
gen_ai.request.model
gen_ai.operation.name
gen_ai.tool.name
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
gen_ai.reasoning.depth
Without standardized semantics, traces become impossible to query consistently across platforms.
This is especially important once multiple agents and tools interact together.
Tool Graphs Matter More Than Request Latency
Most production issues come from tool orchestration.
A single request may generate:
Prompt
├── Query CloudWatch
├── Query ECS Logs
├── Retry Metrics Query
├── Validate Correlation
└── Generate RCA
Without graph-level tracing, teams cannot explain:
- retry explosions
- unstable planning
- latency spikes
- cost drift
A simple but effective pattern is tracing every tool independently.
def traced_tool(tool_name, tool_fn):
def wrapper(*args, **kwargs):
with tracer.start_as_current_span(
f"tool:{tool_name}"
):
return tool_fn(*args, **kwargs)
return wrapper
This is what makes non-deterministic systems debuggable.
A Practical Observability Architecture
A minimal production-ready architecture increasingly looks like this:
Strands Agents
↓
AWS AgentCore Runtime
↓
OpenTelemetry Collector
↓
Datadog / Grafana / CloudWatch
The key idea is simple:
- traces capture reasoning paths
- metrics expose runtime stability
- logs preserve execution context
- semantic spans reconstruct cognitive behavior
This is the operational foundation.
Semantic Sampling
This was another painful lesson.
Traditional sampling strategies break badly with agents.
Initially, we sampled traces exactly like a normal backend:
- high latency
- errors
- throughput spikes
That turned out to be a mistake.
Some of the most problematic executions were not slow.
They were cognitively unstable.
For example:
- recursive replanning
- abnormal reasoning depth
- excessive tool fanout
- retry storms
None of those necessarily produce infrastructure alerts.
Which means the most valuable traces can easily disappear during sampling.
We eventually switched toward semantic sampling instead.
keep_trace_if:
- reasoning_depth > 5
- retries > 3
- tool_fanout > 10
That single change made debugging dramatically easier.
Especially because some problematic executions were generating massive traces that we originally filtered out to reduce storage costs.
Ironically, those were exactly the traces we needed during incidents.
Especially for identifying agents that looked healthy from an infrastructure perspective but behaved erratically internally.
This is probably one of the biggest operational differences between classic distributed systems and agentic runtimes.
What We Ended Up Instrumenting
After a few incidents, we eventually standardized around a small set of signals.
Not because they were theoretically interesting.
Because they consistently explained unstable executions.
| Signal | Why it matters |
|---|---|
| reasoning_depth | detects recursive replanning |
| tool_fanout | identifies orchestration explosion |
| retry_count | exposes unstable reasoning |
| memory_context_size | detects context drift |
| planning_duration | identifies degraded planners |
One useful pattern was attaching these attributes directly to spans.
A simplified trace payload looked roughly like this:
{
"trace_id": "7f21",
"span": "tool:cloudwatch",
"attributes": {
"gen_ai.tool.name": "cloudwatch",
"gen_ai.reasoning.depth": 4,
"gen_ai.retry_count": 2,
"gen_ai.memory.context_size": 18234
}
}
That made problematic executions immediately easier to identify.
Especially during recursive planning incidents.
What Teams Should Actually Build
Traditional sampling strategies break badly with agents.
Two requests with identical latency may have radically different reasoning complexity.
Which means sampling only based on throughput or latency is dangerous.
The most valuable traces are often:
- recursive reasoning loops
- excessive retries
- abnormal tool fanout
- unstable planning chains
In practice, agentic platforms increasingly require semantic sampling.
For example:
keep_trace_if:
- reasoning_depth > 5
- retries > 3
- tool_fanout > 10
This is one of the biggest differences between traditional distributed tracing and agentic observability.
If you are building an agentic platform today, focus on four things:
- Instrument reasoning steps with OpenTelemetry
- Trace every tool execution independently
- Monitor reasoning depth and retries
- Correlate reasoning traces with infrastructure telemetry
That is the real operational foundation.
Not prompt engineering.
Conclusion
AI agents are forcing observability to evolve.
Traditional systems required distributed tracing for infrastructure.
Agentic systems now require distributed tracing for reasoning.
And that is probably the most important shift introduced by platforms like AWS AgentCore.
Top comments (0)