You Didn't Escape Airflow's Complexity — You Just Distributed It

#architecture #automation #dataengineering #systemdesign

Adding Kestra, Dagster, or Prefect alongside Airflow doesn't reduce orchestration complexity. It multiplies it. Here's what the hidden coordination debt actually looks like — and what to do about it.

The typical data team's orchestration stack evolves in reasonable steps. You start with Airflow. It's fine. The team knows it. DAGs run on schedule. Five years in, Airflow is running 300 DAGs and everyone's quietly afraid to touch the base image.

Then you need something modern. A new hire pushes for Prefect — it's Python-native, the developer experience is better, and the UI is cleaner. So you start new projects in Prefect and leave the old ones in Airflow.

Then an ML team shows up. They want Dagster because asset-centric thinking and lineage tracking fit their feature store work. Reasonable. You add Dagster.

Nobody made a bad decision. Each tool was the right call in context. But the team is now paying for three schedulers, three sets of workers, three monitoring dashboards, and three mental models. When data flows from Airflow into Dagster before going to a Prefect-orchestrated API call, the lineage breaks. You can see each step in isolation. You cannot see the whole chain.

This is the orchestration tax. And it's nearly universal in companies that have been building data infrastructure for more than two years.

How the tax shows up
The hidden bill appears in three places most teams don't measure.

The coordination seam. When Pipeline A (Airflow) needs to trigger Pipeline B (Dagster), how does it do that? Usually: a file drop, a database flag, an API call, or — most common — a Slack message between humans who own each system. That "integration" is now load-bearing. When it breaks, it fails silently. You find out three hours later when the Dagster pipeline ran on yesterday's data.

Some teams end up with an entire engineer dedicated to maintaining what they internally call "the glue layer." That's a full-time role writing Python scripts to make three orchestration tools pretend they're one.

The debugging maze. A data quality issue surfaces in the BI tool. The number is wrong. Where did it go wrong? You start at the Airflow logs. The DAG succeeded. You check Prefect — the event flow succeeded. You check Dagster — the assets materialized. Somewhere in the handoff between systems, something went sideways, and there is no unified view of what happened.

The MTTR (mean time to resolution) for cross-system failures is consistently 3-5x higher than single-system failures across the teams that track this. The debugging cost is the biggest hidden piece.

The context-switching toll. Airflow's scheduler thinks in cron expressions and task dependencies. Dagster thinks in assets and freshness policies. Prefect thinks in flows and deployments. Each has its own authentication model, its own secret management, its own way to handle retries. Engineers become fluent in all three — which means they're expert in none of them, and every tool transition costs cognitive overhead that doesn't show up in any sprint tracker.

The Kestra situation
This is why Kestra's marketing resonates. Their pitch — "you can run Airflow, Spark, dbt, and custom scripts, all from one orchestrator" — addresses the multi-tool frustration directly.

But there's a difference between a single pane of glass and a single source of truth. Kestra can wrap your existing tools. That's useful. It doesn't actually reduce the distributed coordination problem. You've added another tool on top of three tools.

The orchestration sprawl isn't a UI problem. It's a data flow ownership problem. Who owns the event that triggers the chain? Who owns the schema of the data passing between systems? Who's responsible when the handoff between step 2 and step 3 fails?

A new orchestration layer at the top doesn't answer those questions. It just adds one more system to look at when you're debugging at 2 AM.

What actually helps
Let's be direct about what works versus what just moves the problem around.

Works: consolidating around one model, aggressively. Pick the tool that handles 80% of your current workload well, migrate everything you can, and live with the friction of moving legacy jobs. It's painful for six months. After that, you have one scheduling model, one set of workers, one place to look when things fail. The teams that do this consistently report 40-60% reduction in incident response time within a year.

Works: treating inter-system handoffs as first-class data. If you have to run multiple tools for legitimate reasons (e.g., ML pipelines genuinely do benefit from Dagster's asset model), make every handoff an explicit, monitored data transfer. Not a file drop. Not a database flag that someone added to a table four years ago. A defined schema, with observability, with retries, with alerting. The glue becomes part of your system design rather than an accident of it.

Doesn't work: adding observability on top of fragmentation. Another dashboard showing all three systems' status doesn't fix the coordination problem — it just makes the distributed failure visible in more places. You need fewer things to observe, not better tools for observing more things.

Doesn't work: migration theater. "We're migrating to Dagster over the next 18 months" is not a plan. It's a statement that the pain isn't quite bad enough yet to do the actual work. Until you actually retire the old tool, you're just adding integration surface area while you plan.

The batch/streaming piece
One real reason teams run multiple orchestration tools is that batch and streaming genuinely have different requirements. Airflow schedules jobs. Kafka processes streams. Different paradigms, different tooling — and if you're trying to serve both in the same data platform, you end up with two separate workflow management systems.

This is worth naming directly: a platform that handles both batch and streaming within the same deployment model, same workflow definition, and same operations tooling means the same team that runs the nightly ETL can own the real-time event processing. Not because anyone is reinventing Airflow or Kafka, but because the split between "scheduled" and "event-driven" shouldn't require two separate engineering specialties and two separate monitoring systems.

The goal isn't to replace everything you have. It's to stop paying the tax.

The actual question
The conversation usually comes around to the same question: "Is this actually a problem worth solving, or just the nature of building data systems?"

Fair. Every company has technical debt. Not every debt is worth paying off.

Here's a simple way to think about it: if your on-call rotation includes "check all three schedulers" as a step in every runbook, you're paying the orchestration tax every week. If a new data engineer needs a month to become productive because they have to learn the mental models of multiple tools, you're paying it every hire. If your debugging process requires cross-referencing three different log systems, you're paying it every incident.

Add that up. Then decide.

DEV Community

You Didn't Escape Airflow's Complexity — You Just Distributed It

Top comments (0)