uknowWho

Posted on May 12

Self-Healing Windows Servers: How AbayaTrack Achieves Sub-15-Second Recovery With No Babysitting"

#ai #webdev #programming #productivity

You believe a server needs a human watching it.

That belief is costing you sleep, attention, and eventually — data.

It's the default assumption of every small operation running monitoring software on-premise. Someone opens the dashboard each morning. Someone notices when it goes dark. Someone restarts the process. That someone is the weakest link in your entire architecture — not the hardware, not the network, not the code. The human.

AbayaTrack runs on Windows laptops. Three to five of them. Each one holding a browser tab open on the employee check-in dashboard. Each one silently assuming the server laptop is fine. When it isn't — when it crashes at 11am because Windows decided to apply an update — check-ins stop. Productivity data goes dark. And nobody knows until someone notices the screen.

The goal is not a perfect system. The goal is a system whose failures are invisible, self-correcting, and sub-15-second. Every architecture decision that follows is measured against that single standard.

Autonomy does not mean the system never fails. It means the system recovers before the failure becomes visible to anyone who matters.

There Are Exactly Four Ways AbayaTrack Goes Dark

Not ten. Not a hundred. Four. Name them precisely, and each one becomes solvable.

Failure Mode 1 — The Process Crash

Node.js is a single process. An unhandled exception kills it entirely. One bad database query, one null pointer, one malformed payload — the process exits, the port closes, and every client sees a network error simultaneously.

Think of it as a chef who stops cooking the moment one customer complains loudly. The entire kitchen closes. Everyone waits. PM2 is the restaurant manager who immediately sends in the sous chef — no announcement, no delay.

// ecosystem.config.js — PM2 process guardian
module.exports = {
  apps: [{
    name: 'abayatrack',
    script: './src/server.js',
    instances: 'max',            // use every CPU core
    exec_mode: 'cluster',        // zero-downtime restarts
    max_memory_restart: '400M',  // catch memory leaks early
    restart_delay: 3000,
    exp_backoff_restart_delay: 100, // don't crash-loop forever
    watch: false,                // NEVER watch in production
  }]
}

# One-time setup — survives Windows reboots
pm2-startup install && pm2 save

Failure Mode 2 — The Sleep Problem

Windows power management is optimised for laptops, not servers. A closed lid or idle timer suspends the process, closes all ports, and lies dormant — technically alive, completely unresponsive. It's a shop that locks its doors mid-business-day but leaves the lights on. From outside, you cannot tell if anyone is home.

// src/utils/keepAwake.js
// Sends a harmless key press every 4 minutes.
// Windows cannot sleep if the keyboard is "active."
export function preventSleep() {
  setInterval(() => {
    exec(
      `powershell -command "$wsh = New-Object -ComObject WScript.Shell; $wsh.SendKeys('{SCROLLLOCK}')"`,
      (err) => { if (err) logger.warn('Keep-awake failed'); }
    );
  }, 4 * 60 * 1000);
}

Set this once in Windows power settings as well:

Control Panel → Power Options → When I close the lid → Do Nothing (plugged in)

Failure Mode 3 — The Hardware Death

The primary laptop loses power, overheats, or dies. Every other laptop keeps displaying the dashboard, but the server is gone. The reason this is solvable: you already own 3–5 laptops. Any one of them can serve as primary. The problem is coordination — none of them currently know who should lead.

Failure Mode 4 — The Sync Gap

Cloudflare D1 is reachable via the internet. The internet is not guaranteed. D1 sync fails silently, records queue locally, and if the queue grows unbounded — data consistency erodes. The fix is architectural: write locally first, sync asynchronously, alert only when the queue exceeds a threshold.

Four Tiers of Resilience — Each One a Force Multiplier

Tier	Name	What It Does	Recovery Time	Cost
1	Guardian	PM2 restarts crashed processes, survives reboots	~3 seconds	Free
2	Watchdog	Detects frozen event loops, pings peers every 15s	~15 seconds	Free
3	Failover	Elects a new leader when primary is unreachable	~15 seconds	Free
4	Offline Buffer	Client-side cache syncs check-ins on reconnect	Seamless	Free

Every tier costs nothing but engineering time. Together, they take AbayaTrack from a fragile single point of failure to a distributed system that recovers from most failures before anyone notices.

No Load Balancer. No Orchestrator. The Laptops Elect Their Own Leader.

Traditional redundancy requires a middleman: a load balancer, a cluster manager, a Kubernetes control plane. Each is brilliant. Each is also a toll booth — another component that can fail, another bill, another configuration surface.

Fewer components = fewer failure modes.

The laptops talk directly to each other. Every 15 seconds, each one asks: "Is the primary alive?" If the answer is no, the first healthy peer promotes itself. No external arbitration. No round-trip to a cloud service. Decision made in milliseconds, entirely on the local network.

// src/ha/peerElection.js
// Every laptop runs this. They all watch each other.

const PEERS = [
  'http://192.168.1.101:3000', // Laptop 1 — preferred primary
  'http://192.168.1.102:3000', // Laptop 2 — first standby
  'http://192.168.1.103:3000', // Laptop 3 — second standby
];

let currentLeader = PEERS[0];

async function electLeader() {
  for (const peer of PEERS) {
    try {
      const res = await fetch(`${peer}/health`, {
        signal: AbortSignal.timeout(3000) // 3s — no waiting around
      });
      if (res.ok) {
        if (peer !== currentLeader) {
          logger.warn(`Leader elected: ${peer}`);
          currentLeader = peer;
        }
        return peer; // first healthy peer wins — done
      }
    } catch {
      logger.warn(`Peer down: ${peer}`); // try next
    }
  }
  return null; // total outage — show offline banner
}

// Run every 15 seconds. Costs microseconds of CPU.
setInterval(electLeader, 15_000);
electLeader(); // run immediately on boot

Why this works without a coordinator: The peer list is ordered by preference. Every laptop uses the same list in the same order. They will always independently elect the same leader — no consensus protocol, no network round-trip, no split-brain. This is the Bully Algorithm, simplified for a trusted local network.

The Night Shift Manager

There are three managers in the building. One is the primary — the most senior, the one everyone reports to. She handles every check-in, every task log, every sync to headquarters in the cloud.

The other two do their own work during the day. But every 15 seconds, quietly, they glance over at the primary's desk. Not intrusively. Just a check: Is she still there? Still responsive? Is the light on?

At 11:14am on a Tuesday, the primary's laptop restarts for a Windows update. Her desk goes dark. The second manager notices within 15 seconds. She walks to the front desk and — without ceremony, without calling anyone, without escalating — begins handling check-ins herself. The queue of pending syncs she's been holding locally flushes to headquarters. Nobody at the check-in station experiences a gap.

The primary comes back online at 11:18am. She sees the second manager at her desk, acknowledges the handoff, and resumes her role. No meeting was held. No incident report filed. No supervisor paged.

The building kept running. That is the whole point.

This Is Not Infrastructure. It Is Operational Trust.

Technical Decision	Business Outcome
PM2 cluster mode + startup hook	Zero manual restarts after Windows updates
Peer election every 15s	Primary hardware failure invisible to users
Client-side offline buffer	No data gaps during server transitions
Local-first writes	Audit trail stays clean during outages
No cloud orchestrator	Adding resilience = one laptop added to a config file

The Questions Engineers Actually Ask

Does this work if two laptops are on different subnets?
No — not without modification. The peer election engine uses direct HTTP on local IPs. For a single office network behind one router, this is never an issue.

What happens to data recorded during a failover gap?
Nothing is lost. The client-side offline cache holds every check-in submission locally in the browser. When the newly elected leader comes online, the cache flushes and syncs. The gap between primary going down and standby becoming active — approximately 15 seconds — produces at most a brief offline banner. No data gaps.

Does Cloudflare D1 handle concurrent writes from multiple laptops?
Yes, with caveats. D1 uses SQLite under the hood — single-writer per database. In this failover architecture, only the elected leader syncs to D1 at any time. Multiple simultaneous writers would require conflict resolution via ON CONFLICT DO UPDATE with a last_write_wins strategy on the updated_at column.

How does PM2 survive a Windows update reboot?
Via the startup hook. Running pm2-startup install registers AbayaTrack as a Windows service. On boot — regardless of who triggered the restart — the service entry fires, PM2 resurrects, and the Node.js process is live within 30 seconds.

Is this UAE PDPL compliant for employee monitoring data?
The architecture alone is not sufficient. Compliance requires written employee consent, a data retention policy enforced in code, encrypted local cache, and a documented right-to-access mechanism. The resilience layer here is compliant-ready — compliance itself requires additional application-level controls.

It Is Not About Running Forever. It Is About Recovering Invisibly.

Every system fails. The only question is whether the failure is visible to the people it serves.

AbayaTrack on supervised laptops fails visibly — someone has to notice, someone has to act. AbayaTrack with peer election, PM2 guardianship, and client-side buffering fails invisibly.

Fewer components. Fewer coordinators. Fewer assumptions about hardware reliability. The laptops you already own become a quorum. The data stays whole. The operation continues.

Not a server with a backup. A consensus system that happens to run on Windows.

Where in your current setup does a single human decision sit between operational continuity and a data gap — and what would it take to remove that human from the critical path?

Abir Abbas builds distributed systems from first principles — peer election, D1 sync, PM2 hardening, compliance layer, the full stack → Abir.abbas@proton.me | Read the same in Medium → medium.com/@md.abir1203
Watch the build on camera → YouTube.com/@wavelinkd

DEV Community