DEV Community

Cover image for Building a Self-Service Sandbox Platform from Scratch
Edith Asante
Edith Asante

Posted on

Building a Self-Service Sandbox Platform from Scratch

This is part of my HNG DevOps internship series. Follow along as I document every stage.


A Quick Recap

Stage 0 was about securing a Linux server. Stage 1 was deploying an API behind Nginx. Stage 2 was containerizing a microservices app. Stage 3 was building a DDoS detection engine. Stage 4 was writing a declarative deployment tool. Stage 5 is the most ambitious yet.

This time there was no starter code. No bugs to fix. No existing app to containerize. I had to build the entire platform from scratch — a self-service system where users can spin up isolated temporary environments, deploy apps into them, simulate outages, monitor health, and have everything auto-destroyed when the lifetime expires. Think of it as a miniature internal Heroku with a chaos engineering toggle.


The Task

The platform had to do all of this on a single Linux VM:

  • Environment Lifecycle — create and destroy isolated Docker environments on demand with a configurable TTL
  • Auto Cleanup Daemon — a background process that scans every 60 seconds and destroys expired environments automatically
  • Dynamic Nginx Routing — every new environment gets its own Nginx config written and reloaded automatically
  • Log Shipping — container logs captured and queryable by environment ID
  • Health Monitoring — a poller that hits every environment's /health endpoint every 30 seconds and marks environments as degraded after 3 consecutive failures
  • Outage Simulation — a script that can crash, pause, disconnect, or stress-test any environment on demand
  • Control API — a REST API with 6 endpoints wrapping all the scripts
  • Makefile — every action available as a make target

The stack was Docker, Docker Compose, Nginx, Bash, Python 3, and Flask. Everything had to spin up with one command.


Step 1: Repo Structure and Scaffold

Before writing a single line of logic I set up the repo structure exactly as specified:

devops-sandbox/
├── platform/
│   ├── create_env.sh
│   ├── destroy_env.sh
│   ├── cleanup_daemon.sh
│   ├── simulate_outage.sh
│   └── api.py
├── nginx/
│   ├── nginx.conf
│   └── conf.d/
├── monitor/
│   └── health_poller.sh
├── logs/
├── envs/
├── Makefile
├── docker-compose.yml
├── README.md
├── .env.example
└── .gitignore
Enter fullscreen mode Exit fullscreen mode

Getting this right first saved a lot of headaches later. Every script references paths relative to the project root, and if those paths don't exist at runtime the scripts fail silently. I also set chmod +x on all shell scripts immediately — forgetting this causes confusing permission errors later.

The .gitignore was set up to exclude envs/, logs/, and .env from the start. These directories contain runtime state and secrets that should never be committed.


Step 2: The Demo App

The platform needed something to run inside each environment. The task was clear that the demo app is not the project — the platform is. So I kept it simple: a Flask app with two routes.

@app.route("/")
def index():
    return jsonify({
        "message": "Hello from the sandbox!",
        "env_id": ENV_ID
    })

@app.route("/health")
def health():
    return jsonify({"status": "ok", "env_id": ENV_ID}), 200
Enter fullscreen mode Exit fullscreen mode

The /health route is the critical one. The health poller depends on it. Every environment container gets its ENV_ID injected as an environment variable so you can always tell which container you are talking to.

The app binds to 0.0.0.0 not 127.0.0.1. This is a mistake I see constantly. If you bind to localhost inside a container, nothing outside the container can reach it — including Nginx.


Step 3: Nginx Dynamic Routing

Nginx is the front door for every environment. The key insight is that nginx.conf never needs to change. It just includes everything in conf.d/:

http {
    include /etc/nginx/conf.d/*.conf;

    server {
        listen 80 default_server;
        return 404 "No environment found\n";
    }
}
Enter fullscreen mode Exit fullscreen mode

When create_env.sh runs, it writes a new file to nginx/conf.d/$ENV_ID.conf and reloads Nginx. When destroy_env.sh runs, it deletes that file and reloads Nginx again. No manual config editing ever.

The conf.d/ directory is mounted as a Docker volume into the Nginx container. This means files written to nginx/conf.d/ on the host appear immediately inside the container. Only a reload is needed, not a rebuild.

One critical mistake to avoid: never write the Nginx config before the container is running. Nginx validates upstream hostnames on reload. If you write a config pointing to a container that doesn't exist yet, the reload fails and Nginx goes down. The order matters — start the container first, then write the config.


Step 4: Environment Lifecycle

create_env.sh is the heart of the platform. It has to do six things in the right order:

  1. Generate a unique env ID from the name and a timestamp suffix
  2. Create a dedicated Docker network for the environment
  3. Connect the Nginx container to that network
  4. Start the app container on that network with a sandbox.env=$ENV_ID label
  5. Write the Nginx config and reload
  6. Write the state file to envs/$ENV_ID.json atomically

The atomic write is important. The cleanup daemon reads these state files in a loop. If a write crashes halfway, the daemon reads garbage and fails. The fix is to write to a temp file first and then mv it into place:

TEMP_FILE=$(mktemp "$ENVS_DIR/.tmp.XXXXXX")
cat > "$TEMP_FILE" << JSON
{
  "id": "$ENV_ID",
  "name": "$ENV_NAME",
  "container": "$CONTAINER_NAME",
  "network": "$NETWORK_NAME",
  "created_at": "$CREATED_AT",
  "ttl": $TTL,
  "status": "running"
}
JSON
mv "$TEMP_FILE" "$ENVS_DIR/$ENV_ID.json"
Enter fullscreen mode Exit fullscreen mode

mv is atomic on Linux when source and destination are on the same filesystem. The daemon either reads the complete file or nothing.

destroy_env.sh reverses all of this in the correct order — kill the log shipper first, stop and remove containers, disconnect Nginx from the network, remove the network, delete the Nginx config, reload Nginx, archive logs, delete the state file. Order matters here too. You cannot remove a network while containers are still connected to it.


Step 5: The Cleanup Daemon

The daemon runs in an infinite loop with a 60 second sleep. On each iteration it reads every file in envs/, computes how much time has passed since created_at, and calls destroy_env.sh if the TTL has been exceeded.

CREATED_EPOCH=$(date -d "$CREATED_AT" +%s)
NOW_EPOCH=$(date -u +%s)
EXPIRES_AT=$((CREATED_EPOCH + TTL))

if [[ "$NOW_EPOCH" -ge "$EXPIRES_AT" ]]; then
    bash "$DESTROY_SCRIPT" "$ENV_ID"
fi
Enter fullscreen mode Exit fullscreen mode

One thing that breaks this: not using nullglob. If envs/ is empty, *.json expands to the literal string *.json and the loop tries to process a file called *.json which doesn't exist. Fix:

shopt -s nullglob
STATE_FILES=("$ENVS_DIR"/*.json)
shopt -u nullglob
Enter fullscreen mode Exit fullscreen mode

Every action is timestamped and written to logs/cleanup.log. The daemon runs in the background with nohup and its PID is saved so make down can stop it cleanly.


Step 6: Health Monitoring

The health poller runs every 30 seconds. For each active environment it finds the container's IP address, hits GET /health, measures the latency, and writes the result to logs/$ENV_ID/health.log.

Getting latency right was harder than expected. My first approach used date +%s%N for nanosecond timestamps. This failed because the %N flag is not supported on the version of Linux on the VM. The numbers came out as something like 14209454ms for a request that obviously took under a second.

The fix was to use curl's own built-in timing:

RESULT=$(curl -s -o /dev/null \
  -w "%{http_code} %{time_total}" \
  --max-time 5 \
  "http://$CONTAINER_IP:5000/health")

HTTP_STATUS=$(echo "$RESULT" | awk '{print $1}')
TIME_SEC=$(echo "$RESULT" | awk '{print $2}')
LATENCY=$(echo "$TIME_SEC * 1000" | awk '{printf "%d", $1 * 1000}')
Enter fullscreen mode Exit fullscreen mode

curl's %{time_total} gives you wall clock time in seconds as a decimal. Multiply by 1000 and you have milliseconds. Accurate and reliable.

After 3 consecutive failures the poller marks the environment as degraded by updating the state file. It also resets the fail counter and restores the status to running when checks pass again. The status update uses the same atomic write pattern as the lifecycle scripts.


Step 7: Outage Simulation

The simulation script accepts --env and --mode flags. The modes map directly to Docker commands:

  • crashdocker kill (SIGKILL, not graceful)
  • pausedocker pause
  • networkdocker network disconnect
  • recover → inspects current state and reverses whichever mode is active
  • stressstress-ng inside the container for 60 seconds

The guard at the top of the script is not optional. It checks whether the target container name matches any protected service names and refuses to run if it does:

PROTECTED=("sandbox-nginx" "cleanup_daemon" "sandbox-api")
for PROTECTED_NAME in "${PROTECTED[@]}"; do
    if [[ "$CONTAINER" == *"$PROTECTED_NAME"* ]]; then
        echo "ERROR: Refusing to simulate outage against protected container"
        exit 1
    fi
done
Enter fullscreen mode Exit fullscreen mode

Without this guard, nothing stops someone from passing the Nginx container ID and taking down the entire platform.

The recover mode was the most interesting to write. It does not know which mode caused the problem — it just inspects the current state and fixes whatever is wrong. Paused? Unpause. Exited? Restart. Network disconnected? Reconnect. This makes recover genuinely useful rather than just a wrapper around one specific undo.


Step 8: The Control API

The Flask API wraps all the scripts via subprocess.run. It has 6 endpoints:

POST   /envs              → create env
GET    /envs              → list active envs + TTL remaining
DELETE /envs/:id          → destroy env
GET    /envs/:id/logs     → last 100 lines of app.log
GET    /envs/:id/health   → last 10 health check results
POST   /envs/:id/outage   → trigger simulation
Enter fullscreen mode Exit fullscreen mode

The TTL remaining calculation happens in Python:

def ttl_remaining(env):
    created = datetime.fromisoformat(
        env["created_at"].replace("Z", "+00:00")
    )
    now = datetime.now(timezone.utc)
    elapsed = (now - created).total_seconds()
    return max(0, int(env["ttl"] - elapsed))
Enter fullscreen mode Exit fullscreen mode

The API runs inside a Docker container with the project directory mounted as a volume and the Docker socket mounted so it can execute Docker commands. This is the standard pattern for tools that need to manage Docker from inside Docker.


Step 9: The Makefile

Every action has a make target. The two most important ones are up and down.

make up starts Nginx and the API via Docker Compose, then starts the cleanup daemon and health poller as background processes with nohup, saving their PIDs to files:

up:
    docker compose up -d --build
    nohup bash platform/cleanup_daemon.sh > logs/cleanup.log 2>&1 &
    echo $$! > logs/cleanup_daemon.pid
    nohup bash monitor/health_poller.sh > logs/poller.log 2>&1 &
    echo $$! > logs/health_poller.pid
Enter fullscreen mode Exit fullscreen mode

make down reads those PID files and kills the processes cleanly:

down:
    @if [ -f logs/cleanup_daemon.pid ]; then \
        kill $$(cat logs/cleanup_daemon.pid) 2>/dev/null || true; \
        rm -f logs/cleanup_daemon.pid; \
    fi
Enter fullscreen mode Exit fullscreen mode

Makefile syntax has one rule that catches everyone: indentation must use tabs, not spaces. If you use spaces, make throws a cryptic missing separator error that has nothing to do with separators.


Problems I Hit Along the Way

Docker permission denied on a fresh VM — The ubuntu user is not in the docker group by default. Fix: sudo usermod -aG docker $USER followed by newgrp docker.

Nginx crashing on startup — I left a sample example.conf file in nginx/conf.d/ as a reference. Nginx tried to resolve the upstream hostname example:5000 on startup, failed, and crashed. The fix was obvious in hindsight: delete the sample file before starting Nginx.

Disk full during Docker builddocker system prune -af recovered the space. The build cache had accumulated several GB from previous builds and test runs.

demo-app:latest image lost after prune — Docker prune removes all images not referenced by a running container. After cleaning disk space the demo app image was gone. Always rebuild the demo app image after a prune: docker build -t demo-app:latest ./demo-app.

Health log latency showing 14 million milliseconds — Caused by date +%s%N not being supported. Fixed by switching to curl's %{time_total} timing.


The Big Picture

What we built Why it matters
Dedicated Docker network per environment Complete isolation — environments cannot interfere with each other
Atomic state file writes Prevents corruption when daemon and scripts write concurrently
Nginx config as code Dynamic routing without touching the main config
Log shipper PID tracking Prevents zombie processes on destroy
Guard in simulation script Prevents accidental destruction of platform infrastructure
Health-based degraded detection Automated observability without external tooling
REST API over raw scripts Makes the platform programmable and integratable

The hardest part of this task was not any single script. It was understanding the correct order of operations. Create the container before writing the Nginx config. Kill the log shipper before removing the container. Disconnect the network before removing it. Write state files atomically. These ordering constraints are not obvious until something breaks, and when they break they break in confusing ways.

That is the difference between infrastructure that works in a demo and infrastructure that works at 3am when something goes wrong.


Stage 5 complete. Find me on Dev.to | GitHub

Top comments (0)