Observability for Serverless: What's Different

#sre #devops #serverless #observability

Everything you know about observability needs a slight rethink when you move to serverless. Let me save you the weeks of frustration I went through.

Cold starts matter now

Your p99 latency is dominated by cold starts. A function that runs in 50ms warm can take 2 seconds cold. If you measure latency without splitting warm vs cold, your metric is lying to you.

Instrument: request_duration_ms tagged with cold_start: true|false. Alert on warm p99, not blended p99.

No persistent process, no persistent buffers

Your usual tracing SDK probably buffers events and flushes periodically. On serverless, the runtime can die between invocations. You lose the buffered events.

Fix: flush at the end of every invocation, synchronously. Yes, it adds latency. Yes, it's worth it.

Logs are your friend, not your traces

Traces are hard across serverless boundaries. Events are not. If your architecture is heavily event-driven, invest in structured logging and a log-based observability pattern before you invest in traces.

Cost is the metric

On EC2, you don't care about per-request cost. On Lambda, you obsessively do. Every millisecond of compute is money. Instrument execution time and memory explicitly. A function that runs in 150ms vs 50ms is 3x more expensive.

Fan-out visibility

When one event triggers 1,000 downstream function invocations, you need to see the tree. Most dashboards weren't built for this. Either use X-Ray/CloudWatch logs insights tricks, or push to a tool that handles fan-out explicitly.

Serverless isn't harder to observe. It's differently observed. Adjust your instincts or you'll miss the real problems.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com