DEV Community

Melvin Bucio
Melvin Bucio

Posted on

How I Built a Free Video & Audio Tool Suite for $20/Month

I got tired of video editing tools that either charged money, added watermarks, or made you create an account just to do something simple like remove silence from a recording.
So five weeks ago I built my own. And then kept building. It's now a full video and audio tool suite with 8 tools, getting organic traffic from Google, ChatGPT, and Copilot, all for about $16-20/month in infrastructure costs.
Here's exactly how it's built.

The stack

Frontend: Pure static HTML on Vercel. No React, no Next.js, no build step. Just HTML, Tailwind CDN, and vanilla JS. Vercel's free tier handles it.
Backend: FastAPI on Railway with 2 worker replicas
Queue: Redis + ARQ (async job queue for Python)
Storage: Cloudflare R2 with 1-day lifecycle rules
Processing: FFmpeg for everything

Monthly cost breakdown:

Railway (2 replicas): ~$10-12
Cloudflare R2: ~$1-2
Redis (Railway): ~$3-5
Vercel: $0
Total: $16-20/month

How the processing pipeline works
Every tool follows the same pattern:

User uploads a file to FastAPI via the browser
FastAPI streams it to Cloudflare R2
FastAPI enqueues a job in Redis via ARQ
A worker replica picks up the job, downloads the file from R2, runs FFmpeg, uploads the result back to R2
Frontend polls /status every 500ms until the job is complete
User gets a presigned download URL, file auto-deletes after 15 minutes

The flow goes: browser uploads to FastAPI, FastAPI streams to R2 and enqueues a job in Redis, a worker picks up the job, downloads from R2, runs FFmpeg, uploads the result back to R2, and writes the status. The frontend polls every 500ms until it gets a presigned download URL back.
The interesting technical decisions
No user accounts, ever
This was a deliberate architectural decision, not just a UX choice. No accounts means no user database, no sessions, no auth system, no password resets, no GDPR compliance headaches, no data breach liability. Every request is stateless. Files are identified by a UUID job ID, not a user ID.
The tradeoff is you can't offer saved history or preferences. That's fine for a free utility tool. People just want to process a file and leave.
15-minute file deletion
Files are deleted from R2 automatically via lifecycle rules after 1 day, but a cleanup job also runs 15 minutes after each job completes. This isn't just a privacy feature. It keeps R2 storage costs near zero since files never accumulate.
max_jobs=1 per worker
ARQ supports concurrent jobs per worker, but FFmpeg is CPU-bound. Running two FFmpeg processes on the same Railway instance causes them to compete for CPU and both slow down. Setting max_jobs=1 means each worker processes one file at a time. With 2 replicas you get 2 simultaneous jobs, enough for current traffic and easy to scale by adding replicas.
The boto3 mistake that caused 504s
This one took me an embarrassingly long time to catch.
boto3 (the AWS/R2 SDK) is synchronous. My first version called it directly inside a FastAPI async endpoint. Under any meaningful upload load, the event loop blocked while the file transferred to R2, requests piled up, and the server started returning 504s.
The fix was one line:
pythonawait asyncio.to_thread(storage.upload_file, tmp_path, key)
This runs the blocking boto3 call in a thread pool, freeing the event loop to handle other requests during the upload. Obvious in hindsight, painful to debug live.
The silence removal pipeline is not one command
A naive implementation uses a single silenceremove filter:
ffmpeg -i input.mp4 -af silenceremove=stop_periods=-1:stop_duration=0.5:stop_threshold=-35dB output.mp4
This works but gives you very little control over cut padding and re-encode behavior. The actual implementation uses a two-pass approach: silencedetect to find the boundaries, then segment-cut and re-stitch with concat. More code but much better results, especially for speech with natural breath gaps you want to preserve.

What each tool taught me

Remove silence: Two-pass silence detection beats single-command filters for speech content
Extract audio: libmp3lame at 192k constant bitrate is the right default for spoken audio. VBR (-q:a) is fine for music but creates surprises when input is voice with quiet sections
Compress video: CRF 23 with libx264 veryfast is the sweet spot for quality vs speed on Railway's hardware
Mute video: Stream copy (-c:v copy -an) makes muting essentially instant since no re-encode is needed
Trim video: Re-encoding is worth the extra seconds vs stream copy because keyframe alignment causes noticeable off-by-seconds errors that users notice
Video to GIF: Palette generation in a single FFmpeg invocation matters far more than I expected. Without it, GIF banding is obvious even at small sizes
Resize to 9:16: The blur bars effect requires splitting into two streams in the filter graph. Non-obvious but produces much better results than black bars
MP4 to MP3: Same backend as extract audio, different SEO surface. One FFmpeg function, two landing pages, two different keyword clusters

The SEO side
Since the goal is organic traffic, I put real effort into the SEO infrastructure from day one: FAQPage, HowTo, and WebApplication JSON-LD schema on every tool page, comparison pages targeting "free Descript alternative" queries, blog posts targeting long-tail keywords, and Spanish versions of all pages.

Search Console performance after 5 weeks

Five weeks in: 27 clicks in Google Search Console, average position 5.5, and organic referrals already coming in from ChatGPT and Copilot.
The AI referrals were the most surprising thing. ChatGPT and Copilot were recommending the site before Google was sending meaningful traffic. Plain-language description paragraphs on each tool page ("This tool removes silence from video and audio. Upload your file and download a clean version in seconds.") seem to help AI systems understand and cite the tools accurately.

What surprised me overall
Static HTML is underrated. No build pipeline, no framework updates, no hydration errors, instant Vercel deploys. For a tool suite where each page is mostly the same structure with different copy, it's the right call. I would make the same decision again.
The other surprise was how well the cost scales. Eight tools, all running through the same pipeline, for $16-20/month. FFmpeg does the heavy lifting and Railway scales horizontally by just adding replicas. The unit economics are genuinely good.

What's next
More tools. Each one is a new keyword cluster, a new internal link target, and a new surface for AI citation. The goal is eventually a comprehensive free video utility suite, the way iLovePDF did it for PDF tools, built one tool at a time.
If you're building something similar or have questions about the FFmpeg pipeline, the ARQ setup, or the R2 lifecycle configuration, drop a comment. Happy to go deeper on any of it.
You can try what I built at vidclean.net.

Top comments (0)