DEV Community

Cover image for I Built a Game to Understand Fly.io's Orchestrator: flyd Operator Sim!
Wassim Soltani
Wassim Soltani

Posted on

I Built a Game to Understand Fly.io's Orchestrator: flyd Operator Sim!

I've recently been deep diving into Fly.io's infrastructure, particularly their flyd orchestration server and the superfly/fsm library that powers its stateful operations. To truly grasp the operational challenges, I built an interactive simulation game: flyd Operator Sim.

Play it here: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim

flyd Operator Sim Cover


πŸ€” Why Build a Simulation?

Fly.io's platform is impressive. Reading their insightful blog posts and their public infra-log revealed the complexities of flyd. The superfly/fsm library also highlighted their focus on robust state management.

I wanted to explore:

  • What kind of incidents can actually occur on a worker node running flyd?

  • How does an operator diagnose and respond to these issues?

  • What's the impact of different actions on system health and application uptime?

  • How do Finite State Machines (FSMs) play a role in managing complex operations like machine migrations, even if it's abstracted away from the operator in a crisis?

Building a sim felt like the best way to learn.


✨ Introducing: flyd Operator Sim!

In flyd Operator Sim, you're an on-call engineer for a Fly.io region. Your goal:

  • Monitor worker health (CPU, memory, flyd status).
  • Respond to incidents like flyd stalls, containerd sync issues, network partitions, and storage corruption (many inspired by the infra-log).
  • Act using tools like flyd restarts, worker drains, log inspection, and (risky!) FSM overrides.
  • Maintain Uptime over a simulated period.

flyd errors

Game Objective & Progression:
Your main goal is to maintain high application uptime across your workers for 7 simulated days. Each day lasts about 5 minutes in real time. To make things more interesting, you start with one worker, and an additional worker is added each day, up to a maximum of four, increasing your responsibilities and potential points of failure!


πŸŽ“ What I learned

  1. Orchestration is Complex: Simulating even a part of it showed me the immense challenge of managing global infrastructure.
  2. State Management is Crucial (and complicated): The game reinforced how vital accurate state is for flyd and why a solid FSM library like superfly/fsm is essential, especially seeing potential containerd desync issues.
  3. Observability is Non-Negotiable: Good metrics and logs (which the game simulates access to) are critical for diagnosing issues, a theme evident in Fly.io's own infra-log.
  4. Operational Trade-offs: The sim touches on the pressure of quick fixes versus safer, slower solutions.

πŸ€“ Tech Stack

Built with: Next.js, TypeScript, Tailwind CSS, Radix UI (shadcn), and React Context.


πŸ’­ Try It & Share Your Thoughts!

This was a personal learning project, but I hope others find it useful or fun.

What incidents should I add next? How can it be a better learning tool? Let me know!

Thanks for reading πŸ’–

flyd mastery badge

Top comments (6)

Collapse
Β 
dotallio profile image
Dotallio β€’

Love how you turned complex infra ops into an interactive sim - I feel like more platforms need stuff like this for onboarding.

Have you thought about adding cascading failures or partial network outages to make it even closer to real-life chaos?

Collapse
Β 
wsoltani profile image
Wassim Soltani β€’

That sounds like the perfect next step!

I'd love to keep adding incident types and gameplay mechanics that can potentially help teach more infra concepts and mirror real-life chaos.

It would be awesome if more people decide to jump in and help improve the sim!

Collapse
Β 
youngfra profile image
Fraser Young β€’

This project really flies above and beyond! I had a bug-tastic time simulating those incidents. πŸͺ°

Collapse
Β 
wsoltani profile image
Wassim Soltani β€’

Not the fly emoji πŸ˜‚

Collapse
Β 
nathan_tarbert profile image
Nathan Tarbert β€’

This is honestly genius, props for building something to actually see how it works in practice.

Collapse
Β 
wsoltani profile image
Wassim Soltani β€’

Thanks for the comment! Would love to see more projects like this. I think it makes learning quicker, easier and a lot more fun!