Alan West

Posted on May 14

How to Migrate a Production Stack to a New Region Without Downtime

#devops #postgres #migration #infrastructure

Last spring I moved an entire production stack — database, object storage, app servers, mail — from one cloud region to another. Different continent, different provider, same users. The plan looked clean on paper. The execution? Less clean.

If you've ever tried to physically relocate a running system, you know the feeling. Stuff that worked locally suddenly doesn't. DNS lies to you. Half your users hit the old server, half hit the new one, and the database that was supposed to be "caught up" is mysteriously 12 minutes behind.

This post is about why naive cutovers fail and how to do one that actually works.

The problem: "just point DNS at the new server"

I've seen this approach proposed in a dozen migration docs. Spin up the new stack, dump and restore the database, change the DNS record, done. It almost never works that cleanly, and the reason is rarely a single thing — it's a pile of small things that all hit at once.

Here's what actually goes wrong:

DNS TTLs aren't honored. Some resolvers cache aggressively past your TTL. You'll have traffic hitting the old IP for hours, sometimes days.
In-flight writes get lost. Anything written to the old database between your snapshot and the cutover is gone unless you've planned for it.
Stateful sessions break. Users logged into the old box get bounced to the new one with no session.
Async jobs disappear. Queue workers on the old machine pick up jobs and then get killed mid-execution.
Email auth breaks silently. Your new IP isn't in the SPF record, so deliverability tanks for 48 hours before anyone notices.

Individually, each of these is fixable. Together, on cutover day, they form a small disaster.

Root cause: you have two sources of truth

The core issue is that during cutover, you briefly run two copies of your system that both think they're authoritative. Writes can land on either. Reads can come from either. Whichever side has "the truth" depends on which resolver a client used and when.

The fix is to never have two writable sources of truth at the same time. You either:

Keep the old side authoritative and replicate forward to the new one (then cut writes over atomically), or
Put both behind a proxy that decides for you.

I've done both. The replication approach is simpler when your stack is a typical web app with a relational database. Let's walk through that one.

Step 1: Lower your DNS TTLs days in advance

This is the single most important thing and it has to happen before anything else. If your A record has a 24-hour TTL, your cutover window is effectively 24 hours long.

Drop it to 60 seconds at least 48 hours before the migration. In a BIND-style zone file:

; Before — long TTL is fine for steady state
app.example.com.  86400  IN  A  203.0.113.10

; Two days before cutover — drop way down
app.example.com.  60     IN  A  203.0.113.10

Yes, your DNS query volume goes up. That's fine, it's temporary. After the migration settles, raise it back.

Step 2: Set up logical replication to the new database

For Postgres, logical replication lets you stream changes from the old DB to the new one while the old one stays live. On the publisher (old DB):

-- Create a publication for the tables you want to replicate
CREATE PUBLICATION migration_pub FOR ALL TABLES;

On the subscriber (new DB), after you've restored the initial schema:

-- Point at the old DB and start streaming changes
CREATE SUBSCRIPTION migration_sub
  CONNECTION 'host=old.db.internal port=5432 dbname=app user=replicator password=...'
  PUBLICATION migration_pub;

A few things to know before you do this in production:

Tables without primary keys won't replicate UPDATEs/DELETEs cleanly. Add keys first.
Sequences don't replicate. You have to manually advance them on the new side before cutover, or you'll get duplicate IDs.
Large objects, certain extensions, and some custom types are excluded. Check the official Postgres docs on logical replication restrictions before you assume coverage.

Watch the replication lag. When it's consistently under a second, you're ready.

Step 3: Cut writes over atomically

This is the moment that has to be clean. The pattern I use:

Put the app into read-only mode (a feature flag, a maintenance middleware, whatever).
Wait for in-flight writes to drain and replication lag to hit zero.
Promote the new database (drop the subscription, advance sequences, verify a sample of rows).
Flip the app config to point at the new DB.
Re-enable writes on the new side.

A tiny middleware to enforce read-only mode looks like this in Express:

// Block writes during the cutover window
function readOnlyGuard(req, res, next) {
  // Allow safe methods through
  const safe = ['GET', 'HEAD', 'OPTIONS'];
  if (safe.includes(req.method)) return next();

  // Respect a runtime flag so we can flip it without redeploying
  if (process.env.READ_ONLY === 'true') {
    return res.status(503).json({
      error: 'Maintenance in progress, try again in a few minutes',
    });
  }
  next();
}

The whole window, if you've prepped properly, is usually 2–5 minutes. Long enough to be noticed, short enough that nobody opens a support ticket.

Step 4: Don't forget the boring stuff

The database gets all the attention, but the things that bite you after cutover are usually:

SPF / DKIM / DMARC. Add the new sending IP to your SPF record before cutting over, not after. DNS propagation here matters as much as it does for your A records.
Outbound webhooks. If any third-party service has your old IP allowlisted, update it ahead of time.
Cron jobs. Make sure they're disabled on the old box before you turn off the database. Otherwise you'll have a zombie cron firing requests at a server that no longer has a backend.
Object storage. Mirror it ahead of time and do a final rsync during the read-only window. Don't try to migrate it during the cutover itself.
Backups. Verify the new backup system actually works by restoring a snapshot to a scratch instance. "It ran without errors" is not the same as "it works."

Prevention: design for the next migration

You will do this again. Maybe not next month, but eventually. A few habits that make future moves cheaper:

Keep configuration in environment variables, not baked into images. Region, DB host, bucket name — all env vars, no exceptions.
Use a config service or feature flag for read-only mode from day one. Don't write it under pressure.
Run a quarterly DR drill that includes restoring a backup to a different region. If you've never tested it, it doesn't work.
Document your DNS records and TTLs somewhere that isn't "the registrar UI." A simple text file in the repo is fine.

The migration itself was 90 minutes of focused work on a Saturday morning. The prep was about three weeks of part-time tinkering, mostly lowering TTLs, fixing tables without primary keys, and untangling hardcoded region strings I'd forgotten about. That ratio — weeks of prep, minutes of execution — is the right shape for this kind of work. If your cutover plan looks short, you haven't planned enough yet.

Top comments (1)

Rahul Joshi • May 14

A masterclass in operational resilience that proves zero-downtime regional migrations are about data synchronization and DNS strategy rather than just infrastructure replication. It’s a great blueprint for anyone looking to balance high availability with the high-stakes complexity of a production cutover.