Meidi Airouche for AWS Community Builders

Posted on May 11

Why Traditional Observability Breaks with AI Agents

#aws #ai #observability #agents

One of the biggest mistakes teams make with AI agents is applying traditional observability patterns to non-deterministic systems.

A classic backend request is relatively stable:

Request → Service → Database → Response

An agent execution is not.

Request
  ↓
Planning
  ↓
Memory Retrieval
  ↓
Tool Calls
  ↓
Validation
  ↓
Retries
  ↓
Response

Two identical prompts may generate completely different execution paths.

That changes observability entirely.

The real challenge is no longer monitoring infrastructure.

It’s understanding reasoning.

This is where AWS AgentCore becomes interesting.

Not as another framework.

But as a runtime layer for operating probabilistic systems.

What You Must Observe in an Agentic Platform

Most teams only track:

latency
token usage
request count

That is not enough.

You need reasoning-level telemetry.

At minimum, your platform should expose:

Signal	Why it matters
reasoning depth	detects recursive loops
tool execution graph	explains orchestration complexity
retries	detects unstable planning
memory context size	identifies memory drift
tokens per successful execution	measures reasoning efficiency
planning duration	exposes unstable agent behavior

Without those signals, debugging becomes almost impossible.

Cognitive Tracing with OpenTelemetry

The most effective pattern today is treating reasoning steps like distributed tracing spans.

Every cognitive boundary becomes observable:

planning
retrieval
tool execution
validation
retries

With Strands and OpenTelemetry, this is already straightforward.

from strands import Agent
from strands.telemetry import StrandsTelemetry

StrandsTelemetry().setup_otlp_exporter(
    endpoint="http://otel-collector:4318/v1/traces"
)

agent = Agent(
    model="anthropic.claude-sonnet-4-20250514-v1:0",
    system_prompt="You are an SRE assistant"
)

response = agent(
    "Investigate elevated API latency"
)

Once exported into Datadog, Grafana or CloudWatch, traces stop showing infrastructure execution.

They start showing reasoning behavior.

That is the key shift.

What Actually Breaks in Production

One thing surprised me the first time we pushed an agentic workflow under real traffic.

Nothing was technically failing.

CPU was fine.
Memory was fine.
Latency looked acceptable.

But the platform still felt unstable.

The issue was hidden inside the reasoning traces.

One specific agent kept re-planning after failed CloudWatch queries.

The execution looked roughly like this:

trace_id=7f21...

planner
 ├── retrieve_memory
 ├── tool:cloudwatch
 ├── retry:cloudwatch
 ├── replan
 ├── tool:cloudwatch
 ├── retry:cloudwatch
 └── escalate

At first, we completely missed it because our traces were sampled by latency.

And ironically, those executions were not even the slowest ones.

The real issue was cognitive instability.

The agent was stuck in a planning/retry loop that slowly amplified token usage and tool fanout.

What made it worse is that our OTEL collector started struggling with span cardinality because tool names and dynamic arguments were generating too many unique combinations.

At some point, traces became harder to query than the incident itself.

That incident completely changed how I think about observability for agents.

Infrastructure telemetry alone was useless.

We needed reasoning telemetry.

GenAI Semantic Conventions Matter

The hardest part of operating agentic systems is that failures rarely look like traditional failures.

Most executions technically succeed.

The problem is that reasoning becomes unstable.

For example, a problematic trace may look like this:

trace_id=7f21...

planner
 ├── retrieve_memory
 ├── tool:cloudwatch
 ├── retry:cloudwatch
 ├── replan
 ├── tool:cloudwatch
 ├── retry:cloudwatch
 └── escalate

From an infrastructure perspective, nothing failed.

But cognitively, the agent is clearly unstable:

retries increase
planning loops appear
tool orchestration degrades
latency grows exponentially

This is exactly why traces must expose reasoning semantics.

Otherwise production incidents remain invisible.

GenAI Semantic Conventions Matter

One important evolution is the emergence of GenAI semantic conventions for OpenTelemetry.

Traditional spans expose infrastructure metadata.

Agentic spans must expose reasoning metadata.

For example:

gen_ai.request.model
gen_ai.operation.name
gen_ai.tool.name
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
gen_ai.reasoning.depth

Without standardized semantics, traces become impossible to query consistently across platforms.

This is especially important once multiple agents and tools interact together.

Tool Graphs Matter More Than Request Latency

Most production issues come from tool orchestration.

A single request may generate:

Prompt
 ├── Query CloudWatch
 ├── Query ECS Logs
 ├── Retry Metrics Query
 ├── Validate Correlation
 └── Generate RCA

Without graph-level tracing, teams cannot explain:

retry explosions
unstable planning
latency spikes
cost drift

A simple but effective pattern is tracing every tool independently.

def traced_tool(tool_name, tool_fn):

    def wrapper(*args, **kwargs):

        with tracer.start_as_current_span(
            f"tool:{tool_name}"
        ):
            return tool_fn(*args, **kwargs)

    return wrapper

This is what makes non-deterministic systems debuggable.

A Practical Observability Architecture

A minimal production-ready architecture increasingly looks like this:

Strands Agents
       ↓
AWS AgentCore Runtime
       ↓
OpenTelemetry Collector
       ↓
Datadog / Grafana / CloudWatch

The key idea is simple:

traces capture reasoning paths
metrics expose runtime stability
logs preserve execution context
semantic spans reconstruct cognitive behavior

This is the operational foundation.

Semantic Sampling

This was another painful lesson.

Traditional sampling strategies break badly with agents.

Initially, we sampled traces exactly like a normal backend:

high latency
errors
throughput spikes

That turned out to be a mistake.

Some of the most problematic executions were not slow.

They were cognitively unstable.

For example:

recursive replanning
abnormal reasoning depth
excessive tool fanout
retry storms

None of those necessarily produce infrastructure alerts.

Which means the most valuable traces can easily disappear during sampling.

We eventually switched toward semantic sampling instead.

keep_trace_if:
 - reasoning_depth > 5
 - retries > 3
 - tool_fanout > 10

That single change made debugging dramatically easier.

Especially because some problematic executions were generating massive traces that we originally filtered out to reduce storage costs.

Ironically, those were exactly the traces we needed during incidents.

Especially for identifying agents that looked healthy from an infrastructure perspective but behaved erratically internally.

This is probably one of the biggest operational differences between classic distributed systems and agentic runtimes.

What We Ended Up Instrumenting

After a few incidents, we eventually standardized around a small set of signals.

Not because they were theoretically interesting.

Because they consistently explained unstable executions.

Signal	Why it matters
reasoning_depth	detects recursive replanning
tool_fanout	identifies orchestration explosion
retry_count	exposes unstable reasoning
memory_context_size	detects context drift
planning_duration	identifies degraded planners

One useful pattern was attaching these attributes directly to spans.

A simplified trace payload looked roughly like this:

{
  "trace_id": "7f21",
  "span": "tool:cloudwatch",
  "attributes": {
    "gen_ai.tool.name": "cloudwatch",
    "gen_ai.reasoning.depth": 4,
    "gen_ai.retry_count": 2,
    "gen_ai.memory.context_size": 18234
  }
}

That made problematic executions immediately easier to identify.

Especially during recursive planning incidents.

What Teams Should Actually Build

Traditional sampling strategies break badly with agents.

Two requests with identical latency may have radically different reasoning complexity.

Which means sampling only based on throughput or latency is dangerous.

The most valuable traces are often:

recursive reasoning loops
excessive retries
abnormal tool fanout
unstable planning chains

In practice, agentic platforms increasingly require semantic sampling.

For example:

keep_trace_if:
 - reasoning_depth > 5
 - retries > 3
 - tool_fanout > 10

This is one of the biggest differences between traditional distributed tracing and agentic observability.

If you are building an agentic platform today, focus on four things:

Instrument reasoning steps with OpenTelemetry
Trace every tool execution independently
Monitor reasoning depth and retries
Correlate reasoning traces with infrastructure telemetry

That is the real operational foundation.

Not prompt engineering.

Conclusion

AI agents are forcing observability to evolve.

Traditional systems required distributed tracing for infrastructure.

Agentic systems now require distributed tracing for reasoning.

And that is probably the most important shift introduced by platforms like AWS AgentCore.

DEV Community

Why Traditional Observability Breaks with AI Agents

What You Must Observe in an Agentic Platform

Cognitive Tracing with OpenTelemetry

What Actually Breaks in Production

GenAI Semantic Conventions Matter

GenAI Semantic Conventions Matter

Tool Graphs Matter More Than Request Latency

A Practical Observability Architecture

Semantic Sampling

What We Ended Up Instrumenting

What Teams Should Actually Build

Conclusion

Top comments (0)