John Onukwu

Posted on May 6

Building SwiftDeploy: From Declarative Deployments to Policy-Gated Releases

#devops #docker #nginx #observability

Building SwiftDeploy: From Declarative Deployments to Policy-Gated Releases

Introduction

SwiftDeploy is a DevOps deployment tool I built as part of the HNG DevOps Track. The project started in Stage 4A as a deployment automation tool, then evolved in Stage 4B into a safer and more observable deployment system.

In Stage 4A, I built the deployment engine. SwiftDeploy could read a manifest.yaml file, generate infrastructure configuration files, deploy the stack, and switch between stable and canary modes.

In Stage 4B, I extended the same project with observability, Open Policy Agent policy enforcement, chaos testing, status monitoring, and audit reporting.

The goal was to move from simply starting containers to building a deployment tool that can answer important safety questions before making changes:

Is the host healthy enough to deploy?
Is the canary safe enough to promote?
What happened during deployment?
Can we prove the system was checked before action was taken?

The Design: A Tool That Writes Its Own Infrastructure

The main design principle of SwiftDeploy is that manifest.yaml is the single source of truth.

Instead of manually editing docker-compose.yml and nginx.conf, the CLI reads values from the manifest and generates the required infrastructure files from templates.

The flow is:

manifest.yaml
      |
      v
swiftdeploy init
      |
      v
nginx.conf + docker-compose.yml
      |
      v
docker compose up

This means the generated files can be deleted and recreated again from the manifest. That makes the deployment repeatable and easier to grade, test, and maintain.

A simplified version of my manifest looks like this:

services:
  image: 10johnny-swiftdeploy-stage4b:latest
  port: 3000
  mode: stable
  version: "1.0.0"
  restart_policy: unless-stopped

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 30

network:
  name: swiftdeploy-net
  driver_type: bridge

policy:
  opa_url: http://localhost:8181
  thresholds:
    min_disk_free_gb: 10
    max_cpu_load: 2.0
    max_error_rate_percent: 1
    max_p99_latency_ms: 500

The manifest controls the service image, service port, deployment mode, Nginx port, Docker network, restart policy, and policy thresholds.

Architecture Diagram

                         +----------------------+
                         |      Operator        |
                         |  python swiftdeploy  |
                         +----------+-----------+
                                    |
                                    | queries policies
                                    v
                         +----------------------+
                         |   OPA Policy Engine  |
                         | localhost:8181 only  |
                         +----------------------+

User / curl / Browser
        |
        v
+----------------------+
|   Nginx Container    |
|   Host Port: 8080    |
| X-Deployed-By header |
| Access logs          |
+----------+-----------+
           |
           | internal Docker network
           v
+----------------------+
|  Python API Service  |
|  Internal Port:3000  |
| /                    |
| /healthz             |
| /chaos               |
| /metrics             |
+----------------------+

Only Nginx is exposed to the host. The Python service is not exposed directly. OPA is also not exposed through Nginx; it is only reachable by the CLI through 127.0.0.1:8181.

This design prevents public leakage of the OPA API and keeps all user-facing traffic going through Nginx.

Stage 4A: Deployment Lifecycle

In Stage 4A, SwiftDeploy supported the core lifecycle commands:

python swiftdeploy init
python swiftdeploy validate
python swiftdeploy deploy
python swiftdeploy promote canary
python swiftdeploy promote stable
python swiftdeploy teardown
python swiftdeploy teardown --clean

Init

The init command reads manifest.yaml and generates:

nginx.conf
docker-compose.yml

These files are generated in the project root, not in a separate generated folder.

Validate

The validate command performs pre-flight checks:

1. manifest.yaml exists and is valid YAML
2. all required fields are present
3. Docker image exists locally
4. Nginx port is free
5. nginx.conf syntax is valid

Deploy

The deploy command regenerates the config files, starts the Docker Compose stack, and waits until /healthz confirms that the application is healthy.

Promote

The promote canary command updates manifest.yaml, regenerates the compose file, restarts only the service container, and confirms the mode through /healthz.

The promote stable command switches the deployment back to stable mode.

Stage 4B: Adding Observability — The Eyes

Stage 4B required the API service to expose a /metrics endpoint in Prometheus text format.

I added metrics using prometheus_client.

The application now tracks:

http_requests_total
http_request_duration_seconds
app_uptime_seconds
app_mode
chaos_active

These metrics show:

request throughput
HTTP status codes
request latency
application uptime
current deployment mode
active chaos state

Example test:

curl http://localhost:8080/metrics

A successful metrics output includes values like:

http_requests_total{method="GET",path="/healthz",status_code="200"}
http_request_duration_seconds_bucket
app_uptime_seconds
app_mode
chaos_active

This gave SwiftDeploy the visibility it needed before making promotion decisions.

The Guardrails: OPA Policy Enforcement

Stage 4B also added Open Policy Agent, also known as OPA.

OPA became the policy decision engine. The important rule is that the CLI should not make the allow or deny decision by itself. The CLI collects data, sends it to OPA, and OPA returns a structured decision with reasons.

The policies live inside the policies/ directory:

policies/
├── infrastructure.rego
└── canary.rego

Each policy domain answers a different question.

Infrastructure Policy

The infrastructure policy answers:

Is the host safe enough for deployment?

It denies deployment if:

disk free space is below 10GB
CPU load is above 2.0

This is the hard gate for deployment. If the host does not meet the required safety threshold, swiftdeploy deploy fails before starting the stack.

The CLI sends host data to OPA, including disk and CPU information. OPA then returns a decision like:

Decision: ALLOW
- Infrastructure policy passed

or a denial reason such as:

Decision: DENY
- Disk free 5.20GB is below required 10GB

This makes failures clear to the operator.

Canary Safety Policy

The canary policy answers:

Is the canary safe enough to promote?

It denies promotion if:

error rate is above 1%
P99 latency is above 500ms

Before promotion, SwiftDeploy scrapes /metrics, calculates the error rate and P99 latency, and sends that data to OPA.

This prevents promoting a canary that is already showing unhealthy behavior.

Why Policy Isolation Matters

Policy isolation is important because each policy should own one responsibility.

The infrastructure policy only checks host safety.

The canary policy only checks application safety before promotion.

This means a change to the canary policy does not require changing the infrastructure policy. Each domain owns one question and one set of data.

This also makes debugging easier. The CLI can show which policy passed or failed:

Policy Compliance:
- Infrastructure: PASS
- Canary: PASS

or:

Policy Compliance:
- Infrastructure: PASS
- Canary: FAIL

That makes the reason for blocking a deployment or promotion clear.

No OPA Leakage

OPA is added as a sidecar container in Docker Compose, but it is not routed through Nginx.

The OPA service is bound to localhost only:

ports:
  - "127.0.0.1:8181:8181"

This means the CLI can reach OPA, but users coming through the Nginx port cannot access the OPA API.

That satisfies the no-leakage requirement because Nginx only forwards user traffic to the application service, not to the policy engine.

Gated Deploy Flow

The new deploy flow is:

python swiftdeploy deploy
      |
      v
generate config files
      |
      v
start OPA sidecar
      |
      v
collect host stats
      |
      v
send input to OPA infrastructure policy
      |
      v
deploy only if OPA allows

A successful deploy shows:

OPA is reachable by swiftdeploy CLI

Policy domain: infrastructure
Decision: ALLOW
- Infrastructure policy passed

OK Stack is healthy --> http://localhost:8080

Gated Promote Flow

The new promote flow is:

python swiftdeploy promote canary
      |
      v
start/check OPA
      |
      v
scrape /metrics
      |
      v
calculate error rate and P99 latency
      |
      v
send input to OPA canary policy
      |
      v
promote only if OPA allows

A successful canary promotion shows:

Policy domain: canary
Decision: ALLOW
- Canary safety policy passed

OK Mode confirmed: canary

The canary response also includes:

X-Mode: canary

This makes it easy to confirm that the service is really running in canary mode.

The Chaos: Testing Slow and Error States

The API includes a /chaos endpoint that is active only in canary mode.

It supports slow mode:

{ "mode": "slow", "duration": 3 }

It also supports error mode:

{ "mode": "error", "rate": 0.5 }

Slow mode delays responses. Error mode causes some requests to return HTTP 500 responses.

This makes it possible to test whether the metrics endpoint, status command, and canary policy can detect unhealthy behavior.

Example request:

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d "{\"mode\":\"error\",\"rate\":0.5}"

After injecting chaos, repeated requests increase the error count in /metrics. Then python swiftdeploy status can show a higher error rate or latency.

Example status view:

Requests: 30
Error Rate: 5.0%
P99 Latency: 700ms

Policy Compliance:
- Infrastructure: PASS
- Canary: FAIL

This is the purpose of canary safety. The system should detect the failure before allowing promotion.

Status Command

I added:

python swiftdeploy status

The status command scrapes /metrics, calculates live request statistics, checks policy compliance, and appends every scrape to:

history.jsonl

The command displays information like:

Requests: 8.0
Error Rate: 0.0%
P99 Latency: 75.0ms

Policy Compliance:
- Infrastructure: PASS
- Canary: PASS

This gives the operator a terminal dashboard for the system.

Audit Command

I also added:

python swiftdeploy audit

This command reads history.jsonl and generates:

audit_report.md

The audit report is written in GitHub Flavored Markdown and includes:

deployment timeline
policy checks
mode changes
status scrapes
policy violations

Example structure:

# SwiftDeploy Audit Report

## Timeline

| Time | Event | Summary |
|---|---|---|
| 2026-05-06T23:00:16+00:00 | deploy | Deploy in stable mode |
| 2026-05-06T23:07:36+00:00 | mode_change | canary -> stable |

## Policy Violations

No policy violations recorded.

This gives the project traceability. Instead of only knowing the current state, I can review what happened during the deployment and testing process.

Debugging Journey: Problems I Found and Fixed

While testing Stage 4B, I encountered some realistic DevOps issues. These issues helped me understand the importance of testing, logs, and container permissions.

1. Docker Desktop Was Not Running

At one point, Docker commands failed because Docker Desktop was not running.

The error showed that the Docker API could not be reached.

The fix was to open Docker Desktop, wait until the daemon was running, and then rebuild the image.

This reminded me that local DevOps workflows depend on the container runtime being available before running Docker commands.

2. PowerShell Tried to Run Python Code

While adding the /metrics route, I accidentally pasted Python code directly into PowerShell.

PowerShell tried to interpret lines like:

@app.route("/metrics")
def metrics():

as PowerShell commands, which caused parser errors.

The fix was to write the full Python file using Set-Content and a PowerShell here-string:

@'
# Python code here
'@ | Set-Content -Path app\main.py -Encoding UTF8

This made the update safer and avoided broken copy-and-paste edits.

3. Missing Canary Policy

During implementation, I created the infrastructure policy first. Later, I added the canary policy separately.

The canary policy was important because Stage 4B required at least two separate Rego policies. One policy handles infrastructure safety, while the other handles canary safety.

This improved the policy design and matched the requirement that policy domains should be separated.

4. Nginx Was Restarting Because of Dropped Capabilities

The most important issue I fixed was with Nginx.

I used:

cap_drop:
  - ALL

This is a security best practice, but Nginx still needed some capabilities during startup. The container kept restarting with this error:

chown("/var/cache/nginx/client_temp", 101) failed (1: Operation not permitted)

The service container was healthy and OPA was running, but Nginx could not stay up. Because of that, curl http://localhost:8080/healthz failed.

The fix was to keep cap_drop: ALL, but add back only the capabilities Nginx needed:

cap_add:
  - NET_BIND_SERVICE
  - CHOWN
  - SETUID
  - SETGID

After this fix, the stack became healthy:

OK Stack is healthy --> http://localhost:8080

The containers also showed the correct state:

nginx    Up
opa      Up on 127.0.0.1:8181
service  Up and healthy

This was a good lesson in balancing security and functionality. Dropping all capabilities is safe, but some containers need a small set of capabilities to start correctly.

Final Verification

After fixing the Nginx capability issue, I tested the complete flow again.

The health endpoint worked:

curl http://localhost:8080/healthz

Response:

{"mode":"stable","status":"ok","uptime_seconds":2.85}

The metrics endpoint worked:

curl http://localhost:8080/metrics

It returned Prometheus metrics including:

http_requests_total
http_request_duration_seconds
app_uptime_seconds
app_mode
chaos_active

The status command worked:

python swiftdeploy status

Output included:

Policy Compliance:
- Infrastructure: PASS
- Canary: PASS

The audit command worked:

python swiftdeploy audit

Output:

OK audit_report.md generated

Nginx access logs also showed the required pipe-delimited format:

2026-05-06T23:00:16+00:00 | 200 | 0.045s | 172.20.0.3:3000 | GET /healthz HTTP/1.1

How to Run the Project

Clone the repository:

git clone https://github.com/10Johnny/swiftdeploy-stage4a.git
cd swiftdeploy-stage4a

Install CLI dependencies:

pip install pyyaml jinja2 requests psutil

Build the service image:

docker build -t 10johnny-swiftdeploy-stage4b:latest .

Generate the infrastructure files:

python swiftdeploy init

Deploy the stack:

python swiftdeploy deploy

Check health:

curl http://localhost:8080/healthz

Check metrics:

curl http://localhost:8080/metrics

Promote to canary:

python swiftdeploy promote canary

Check canary header:

curl -I http://localhost:8080/

Run the status dashboard:

python swiftdeploy status

Generate an audit report:

python swiftdeploy audit

Tear down:

python swiftdeploy teardown --clean

Repository Structure

manifest.yaml
swiftdeploy
Dockerfile
README.md
app/
templates/
policies/

Generated files are created in the root folder:

nginx.conf
docker-compose.yml

Audit files include:

history.jsonl
audit_report.md

Lessons Learned

This project taught me that deployment automation is more than starting containers.

I learned how to:

generate infrastructure files from templates
use Docker Compose to manage multiple services
use Nginx as a reverse proxy
expose Prometheus-style metrics
use OPA for policy decisions
separate policy logic from CLI logic
build policy-gated deploy and promote flows
debug container permission issues
create audit reports from deployment history

The biggest lesson was that a deployment tool should provide safety, visibility, and traceability.

A good deployment tool should not only ask:

Can I start the containers?

It should also ask:

Is the host healthy?
Is the canary safe?
What policy allowed this action?
What happened during deployment?
Can I prove it later?

SwiftDeploy Stage 4B helped me understand how observability and policy enforcement make deployments safer and more reliable.

Conclusion

SwiftDeploy started in Stage 4A as a declarative deployment CLI. In Stage 4B, I extended it with metrics, OPA policy checks, canary safety, chaos testing, status monitoring, and audit reporting.

The final result is a deployment tool that can generate its own infrastructure files, observe the running service, enforce safety rules, and keep a record of what happened.

GitHub Repository

https://github.com/10Johnny/swiftdeploy-stage4a

DEV Community

Building SwiftDeploy: From Declarative Deployments to Policy-Gated Releases

Building SwiftDeploy: From Declarative Deployments to Policy-Gated Releases

Introduction

The Design: A Tool That Writes Its Own Infrastructure

Architecture Diagram

Stage 4A: Deployment Lifecycle

Init

Validate

Deploy

Promote

Stage 4B: Adding Observability — The Eyes

The Guardrails: OPA Policy Enforcement

Infrastructure Policy

Canary Safety Policy

Why Policy Isolation Matters

No OPA Leakage

Gated Deploy Flow

Gated Promote Flow

The Chaos: Testing Slow and Error States

Status Command

Audit Command

Debugging Journey: Problems I Found and Fixed

1. Docker Desktop Was Not Running

2. PowerShell Tried to Run Python Code

3. Missing Canary Policy

4. Nginx Was Restarting Because of Dropped Capabilities

Final Verification

How to Run the Project

Repository Structure

Lessons Learned

Conclusion

GitHub Repository

Top comments (0)