Building SwiftDeploy: From Declarative Deployments to Policy-Gated Releases
Introduction
SwiftDeploy is a DevOps deployment tool I built as part of the HNG DevOps Track. The project started in Stage 4A as a deployment automation tool, then evolved in Stage 4B into a safer and more observable deployment system.
In Stage 4A, I built the deployment engine. SwiftDeploy could read a manifest.yaml file, generate infrastructure configuration files, deploy the stack, and switch between stable and canary modes.
In Stage 4B, I extended the same project with observability, Open Policy Agent policy enforcement, chaos testing, status monitoring, and audit reporting.
The goal was to move from simply starting containers to building a deployment tool that can answer important safety questions before making changes:
Is the host healthy enough to deploy?
Is the canary safe enough to promote?
What happened during deployment?
Can we prove the system was checked before action was taken?
The Design: A Tool That Writes Its Own Infrastructure
The main design principle of SwiftDeploy is that manifest.yaml is the single source of truth.
Instead of manually editing docker-compose.yml and nginx.conf, the CLI reads values from the manifest and generates the required infrastructure files from templates.
The flow is:
manifest.yaml
|
v
swiftdeploy init
|
v
nginx.conf + docker-compose.yml
|
v
docker compose up
This means the generated files can be deleted and recreated again from the manifest. That makes the deployment repeatable and easier to grade, test, and maintain.
A simplified version of my manifest looks like this:
services:
image: 10johnny-swiftdeploy-stage4b:latest
port: 3000
mode: stable
version: "1.0.0"
restart_policy: unless-stopped
nginx:
image: nginx:latest
port: 8080
proxy_timeout: 30
network:
name: swiftdeploy-net
driver_type: bridge
policy:
opa_url: http://localhost:8181
thresholds:
min_disk_free_gb: 10
max_cpu_load: 2.0
max_error_rate_percent: 1
max_p99_latency_ms: 500
The manifest controls the service image, service port, deployment mode, Nginx port, Docker network, restart policy, and policy thresholds.
Architecture Diagram
+----------------------+
| Operator |
| python swiftdeploy |
+----------+-----------+
|
| queries policies
v
+----------------------+
| OPA Policy Engine |
| localhost:8181 only |
+----------------------+
User / curl / Browser
|
v
+----------------------+
| Nginx Container |
| Host Port: 8080 |
| X-Deployed-By header |
| Access logs |
+----------+-----------+
|
| internal Docker network
v
+----------------------+
| Python API Service |
| Internal Port:3000 |
| / |
| /healthz |
| /chaos |
| /metrics |
+----------------------+
Only Nginx is exposed to the host. The Python service is not exposed directly. OPA is also not exposed through Nginx; it is only reachable by the CLI through 127.0.0.1:8181.
This design prevents public leakage of the OPA API and keeps all user-facing traffic going through Nginx.
Stage 4A: Deployment Lifecycle
In Stage 4A, SwiftDeploy supported the core lifecycle commands:
python swiftdeploy init
python swiftdeploy validate
python swiftdeploy deploy
python swiftdeploy promote canary
python swiftdeploy promote stable
python swiftdeploy teardown
python swiftdeploy teardown --clean
Init
The init command reads manifest.yaml and generates:
nginx.conf
docker-compose.yml
These files are generated in the project root, not in a separate generated folder.
Validate
The validate command performs pre-flight checks:
1. manifest.yaml exists and is valid YAML
2. all required fields are present
3. Docker image exists locally
4. Nginx port is free
5. nginx.conf syntax is valid
Deploy
The deploy command regenerates the config files, starts the Docker Compose stack, and waits until /healthz confirms that the application is healthy.
Promote
The promote canary command updates manifest.yaml, regenerates the compose file, restarts only the service container, and confirms the mode through /healthz.
The promote stable command switches the deployment back to stable mode.
Stage 4B: Adding Observability — The Eyes
Stage 4B required the API service to expose a /metrics endpoint in Prometheus text format.
I added metrics using prometheus_client.
The application now tracks:
http_requests_total
http_request_duration_seconds
app_uptime_seconds
app_mode
chaos_active
These metrics show:
request throughput
HTTP status codes
request latency
application uptime
current deployment mode
active chaos state
Example test:
curl http://localhost:8080/metrics
A successful metrics output includes values like:
http_requests_total{method="GET",path="/healthz",status_code="200"}
http_request_duration_seconds_bucket
app_uptime_seconds
app_mode
chaos_active
This gave SwiftDeploy the visibility it needed before making promotion decisions.
The Guardrails: OPA Policy Enforcement
Stage 4B also added Open Policy Agent, also known as OPA.
OPA became the policy decision engine. The important rule is that the CLI should not make the allow or deny decision by itself. The CLI collects data, sends it to OPA, and OPA returns a structured decision with reasons.
The policies live inside the policies/ directory:
policies/
├── infrastructure.rego
└── canary.rego
Each policy domain answers a different question.
Infrastructure Policy
The infrastructure policy answers:
Is the host safe enough for deployment?
It denies deployment if:
disk free space is below 10GB
CPU load is above 2.0
This is the hard gate for deployment. If the host does not meet the required safety threshold, swiftdeploy deploy fails before starting the stack.
The CLI sends host data to OPA, including disk and CPU information. OPA then returns a decision like:
Decision: ALLOW
- Infrastructure policy passed
or a denial reason such as:
Decision: DENY
- Disk free 5.20GB is below required 10GB
This makes failures clear to the operator.
Canary Safety Policy
The canary policy answers:
Is the canary safe enough to promote?
It denies promotion if:
error rate is above 1%
P99 latency is above 500ms
Before promotion, SwiftDeploy scrapes /metrics, calculates the error rate and P99 latency, and sends that data to OPA.
This prevents promoting a canary that is already showing unhealthy behavior.
Why Policy Isolation Matters
Policy isolation is important because each policy should own one responsibility.
The infrastructure policy only checks host safety.
The canary policy only checks application safety before promotion.
This means a change to the canary policy does not require changing the infrastructure policy. Each domain owns one question and one set of data.
This also makes debugging easier. The CLI can show which policy passed or failed:
Policy Compliance:
- Infrastructure: PASS
- Canary: PASS
or:
Policy Compliance:
- Infrastructure: PASS
- Canary: FAIL
That makes the reason for blocking a deployment or promotion clear.
No OPA Leakage
OPA is added as a sidecar container in Docker Compose, but it is not routed through Nginx.
The OPA service is bound to localhost only:
ports:
- "127.0.0.1:8181:8181"
This means the CLI can reach OPA, but users coming through the Nginx port cannot access the OPA API.
That satisfies the no-leakage requirement because Nginx only forwards user traffic to the application service, not to the policy engine.
Gated Deploy Flow
The new deploy flow is:
python swiftdeploy deploy
|
v
generate config files
|
v
start OPA sidecar
|
v
collect host stats
|
v
send input to OPA infrastructure policy
|
v
deploy only if OPA allows
A successful deploy shows:
OPA is reachable by swiftdeploy CLI
Policy domain: infrastructure
Decision: ALLOW
- Infrastructure policy passed
OK Stack is healthy --> http://localhost:8080
Gated Promote Flow
The new promote flow is:
python swiftdeploy promote canary
|
v
start/check OPA
|
v
scrape /metrics
|
v
calculate error rate and P99 latency
|
v
send input to OPA canary policy
|
v
promote only if OPA allows
A successful canary promotion shows:
Policy domain: canary
Decision: ALLOW
- Canary safety policy passed
OK Mode confirmed: canary
The canary response also includes:
X-Mode: canary
This makes it easy to confirm that the service is really running in canary mode.
The Chaos: Testing Slow and Error States
The API includes a /chaos endpoint that is active only in canary mode.
It supports slow mode:
{ "mode": "slow", "duration": 3 }
It also supports error mode:
{ "mode": "error", "rate": 0.5 }
Slow mode delays responses. Error mode causes some requests to return HTTP 500 responses.
This makes it possible to test whether the metrics endpoint, status command, and canary policy can detect unhealthy behavior.
Example request:
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d "{\"mode\":\"error\",\"rate\":0.5}"
After injecting chaos, repeated requests increase the error count in /metrics. Then python swiftdeploy status can show a higher error rate or latency.
Example status view:
Requests: 30
Error Rate: 5.0%
P99 Latency: 700ms
Policy Compliance:
- Infrastructure: PASS
- Canary: FAIL
This is the purpose of canary safety. The system should detect the failure before allowing promotion.
Status Command
I added:
python swiftdeploy status
The status command scrapes /metrics, calculates live request statistics, checks policy compliance, and appends every scrape to:
history.jsonl
The command displays information like:
Requests: 8.0
Error Rate: 0.0%
P99 Latency: 75.0ms
Policy Compliance:
- Infrastructure: PASS
- Canary: PASS
This gives the operator a terminal dashboard for the system.
Audit Command
I also added:
python swiftdeploy audit
This command reads history.jsonl and generates:
audit_report.md
The audit report is written in GitHub Flavored Markdown and includes:
deployment timeline
policy checks
mode changes
status scrapes
policy violations
Example structure:
# SwiftDeploy Audit Report
## Timeline
| Time | Event | Summary |
|---|---|---|
| 2026-05-06T23:00:16+00:00 | deploy | Deploy in stable mode |
| 2026-05-06T23:07:36+00:00 | mode_change | canary -> stable |
## Policy Violations
No policy violations recorded.
This gives the project traceability. Instead of only knowing the current state, I can review what happened during the deployment and testing process.
Debugging Journey: Problems I Found and Fixed
While testing Stage 4B, I encountered some realistic DevOps issues. These issues helped me understand the importance of testing, logs, and container permissions.
1. Docker Desktop Was Not Running
At one point, Docker commands failed because Docker Desktop was not running.
The error showed that the Docker API could not be reached.
The fix was to open Docker Desktop, wait until the daemon was running, and then rebuild the image.
This reminded me that local DevOps workflows depend on the container runtime being available before running Docker commands.
2. PowerShell Tried to Run Python Code
While adding the /metrics route, I accidentally pasted Python code directly into PowerShell.
PowerShell tried to interpret lines like:
@app.route("/metrics")
def metrics():
as PowerShell commands, which caused parser errors.
The fix was to write the full Python file using Set-Content and a PowerShell here-string:
@'
# Python code here
'@ | Set-Content -Path app\main.py -Encoding UTF8
This made the update safer and avoided broken copy-and-paste edits.
3. Missing Canary Policy
During implementation, I created the infrastructure policy first. Later, I added the canary policy separately.
The canary policy was important because Stage 4B required at least two separate Rego policies. One policy handles infrastructure safety, while the other handles canary safety.
This improved the policy design and matched the requirement that policy domains should be separated.
4. Nginx Was Restarting Because of Dropped Capabilities
The most important issue I fixed was with Nginx.
I used:
cap_drop:
- ALL
This is a security best practice, but Nginx still needed some capabilities during startup. The container kept restarting with this error:
chown("/var/cache/nginx/client_temp", 101) failed (1: Operation not permitted)
The service container was healthy and OPA was running, but Nginx could not stay up. Because of that, curl http://localhost:8080/healthz failed.
The fix was to keep cap_drop: ALL, but add back only the capabilities Nginx needed:
cap_add:
- NET_BIND_SERVICE
- CHOWN
- SETUID
- SETGID
After this fix, the stack became healthy:
OK Stack is healthy --> http://localhost:8080
The containers also showed the correct state:
nginx Up
opa Up on 127.0.0.1:8181
service Up and healthy
This was a good lesson in balancing security and functionality. Dropping all capabilities is safe, but some containers need a small set of capabilities to start correctly.
Final Verification
After fixing the Nginx capability issue, I tested the complete flow again.
The health endpoint worked:
curl http://localhost:8080/healthz
Response:
{"mode":"stable","status":"ok","uptime_seconds":2.85}
The metrics endpoint worked:
curl http://localhost:8080/metrics
It returned Prometheus metrics including:
http_requests_total
http_request_duration_seconds
app_uptime_seconds
app_mode
chaos_active
The status command worked:
python swiftdeploy status
Output included:
Policy Compliance:
- Infrastructure: PASS
- Canary: PASS
The audit command worked:
python swiftdeploy audit
Output:
OK audit_report.md generated
Nginx access logs also showed the required pipe-delimited format:
2026-05-06T23:00:16+00:00 | 200 | 0.045s | 172.20.0.3:3000 | GET /healthz HTTP/1.1
How to Run the Project
Clone the repository:
git clone https://github.com/10Johnny/swiftdeploy-stage4a.git
cd swiftdeploy-stage4a
Install CLI dependencies:
pip install pyyaml jinja2 requests psutil
Build the service image:
docker build -t 10johnny-swiftdeploy-stage4b:latest .
Generate the infrastructure files:
python swiftdeploy init
Deploy the stack:
python swiftdeploy deploy
Check health:
curl http://localhost:8080/healthz
Check metrics:
curl http://localhost:8080/metrics
Promote to canary:
python swiftdeploy promote canary
Check canary header:
curl -I http://localhost:8080/
Run the status dashboard:
python swiftdeploy status
Generate an audit report:
python swiftdeploy audit
Tear down:
python swiftdeploy teardown --clean
Repository Structure
manifest.yaml
swiftdeploy
Dockerfile
README.md
app/
templates/
policies/
Generated files are created in the root folder:
nginx.conf
docker-compose.yml
Audit files include:
history.jsonl
audit_report.md
Lessons Learned
This project taught me that deployment automation is more than starting containers.
I learned how to:
generate infrastructure files from templates
use Docker Compose to manage multiple services
use Nginx as a reverse proxy
expose Prometheus-style metrics
use OPA for policy decisions
separate policy logic from CLI logic
build policy-gated deploy and promote flows
debug container permission issues
create audit reports from deployment history
The biggest lesson was that a deployment tool should provide safety, visibility, and traceability.
A good deployment tool should not only ask:
Can I start the containers?
It should also ask:
Is the host healthy?
Is the canary safe?
What policy allowed this action?
What happened during deployment?
Can I prove it later?
SwiftDeploy Stage 4B helped me understand how observability and policy enforcement make deployments safer and more reliable.
Conclusion
SwiftDeploy started in Stage 4A as a declarative deployment CLI. In Stage 4B, I extended it with metrics, OPA policy checks, canary safety, chaos testing, status monitoring, and audit reporting.
The final result is a deployment tool that can generate its own infrastructure files, observe the running service, enforce safety rules, and keep a record of what happened.
Top comments (0)