William

Posted on May 9

Gemma 4 on Real Hardware: Local Inference, Cloud API, and a Three-Tier Architecture That Actually Works

#gemmachallenge #gemma #devchallenge #webdev

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Let me be straight with you before we get into any of this.

I am 58 years old. Retired. I build products for enjoyment — income is just a by-product of the effort. ( and yes that took me a long time to figure out, but once you do, everything changes ) I live in Minesing, Ontario. I have a cottage in Bracebridge. I work on projects at night, usually with a smoke and in my housecoat. ( don't judge me, it works )

I have no computer science degree, no coding background, and no team. I am what people in this community politely call a "vibe coder" — which is a kind way of saying I build things with AI tools and a terminal and a lot of stubbornness. And push through. ( and push back ;) ) Stubbornness is 100% non-negotiable when you're working long hours with different AIs at 58. The tools are incredible and they will also confidently break your app and tell you it's your fault. Every. Single. Time. ( you get used to it, I promise )

Over the last year I built three live products from scratch:

Events Arena — a live sports prediction platform with Soccer, NHL, and expandable arenas for any type of event, from organized sports to internet creator events ( yes that last part is a whole thing now )
SaaS Price DB — a REST API tracking pricing data for 1,000+ B2B SaaS tools
TripSync — an AI travel planner that takes plain-English trip descriptions and returns real destinations with flight estimates and booking links Zero users on most of them. I wrote about that honestly on May 9th. ( zero judgement if you go read it, it's a fun honest post ) But here's what I didn't fully explain in that post:

I built TripSync for myself.

Every winter I travel for 3-4 months. Multiple countries. Multiple cities. I move around — I don't sit in one place and call it a vacation. Canadian winters are not for me anymore, and we've done Mexico, the States, and DR. ( Canadian winters, not for me, hard pass, no thank you, never again ) Planning those trips used to mean hours of browser tabs, scattered booking sites, trying to hold flight prices and hotel options and visa requirements in my head at the same time — while also figuring out what order to visit cities in so the flights make sense and we get the best experience at the lowest cost. Premium economy for long flights, economy for short ones. ( yes I have a system, don't we all )

TripSync was my solution to my own problem. I am the user. I use it to plan my own travel — this coming January through May 2027 and any trips between now and then. Whether anyone else ever uses it is a question I'm still working on. ( working on it, not losing sleep over it, there is a difference )

This post is about what I tried to add to it during the Gemma 4 Challenge, and what I actually found. Not what I hoped to find. What I actually found. ( spoiler: I was surprised )

The Fear Every Solo Builder Has But Won't Say Out Loud

I'll be honest about why I'm here because that's the kind of post this is.

I'm in this contest for three reasons: the prize money, the exposure for my projects, and genuinely wanting to help other builders who are in the same position I was six months ago. All three matter. None of them are secret. ( refreshing right? )

But the thing that got me paying attention to Gemma 4 specifically? The API bill.

I try to build all my projects at zero budget. When a project makes some income, I reinvest a part of that into growth. If it makes nothing, I don't put myself in a worse position than I was before the project existed. That's the rule. ( simple rule, hard to follow when you're excited about an idea at 2am, but still the rule )

Every search on TripSync hits an AI API. Every destination card, every itinerary, every refinement — that's a call to a cloud model. Right now I'm on free tiers. But free tiers end. Traffic grows. And suddenly the thing you built for enjoyment is costing you money before it's made you a cent. ( we've all seen this movie and it doesn't end well )

So when the Gemma 4 Challenge came up — Google's open model family that runs locally, for free, no API calls, no data leaving the device — I thought: this is the exact problem I think about. Let me actually try it on a real app.

What Gemma 4 Is, In Plain English

Skip this if you already know. Stay if you're like me and need someone to explain it without assuming a PhD is hiding in your back pocket. ( it's not, I checked )

Gemma 4 is a family of AI models Google has released as open weights. You download the model and run it on your own hardware. No subscription. No per-token charges. No sending user data to a third-party server. ( yes, really, I know, wild )

Four models in the family:

E2B / E4B — tiny, run on phones and Raspberry Pis ( yes a Raspberry Pi, yes that little $75 thing )
12B — sweet spot for most developer laptops
27B dense — the workhorse, needs more RAM
26B MoE — efficient, built for high-throughput reasoning I have a MacBook Pro M1 with 16GB of unified memory. Not a beast of a machine. The kind of setup a lot of solo builders actually have sitting on their desk next to a cold coffee. ( always a cold coffee )

The question I wanted to answer: can Gemma 4 run meaningfully on real hardware that real people actually own?

Choosing the Right Model

Apple M1 Pro with 16GB unified memory has a practical ceiling. Push past it and the system starts swapping to SSD — which turns a 1-second response into a several-minute crawl. Not usable. ( I tried to ignore this ceiling, the ceiling won )

I pulled gemma4 via Ollama — the full default model, 9.6GB. On my M1 with 16GB unified memory, this is right at the edge of what's practical, and that's intentional. I wanted the real performance story at the limit of consumer hardware, not on some server with 64GB of RAM that none of us have. ( none of us have that, right? right? )

What I didn't fully appreciate until I tried it: Apple Silicon unified memory isn't traditional RAM. The CPU and GPU share the same pool, and Apple's neural engine accelerates inference in a way that punches above what the raw numbers suggest. The model performs noticeably better on M1 than the spec sheet implies. ( Apple does some things right, I'll give them that )

The choice wasn't "what's the biggest model I can technically load." It was "what's the right model for hardware a solo builder actually owns." That's a different question, and I think it's the more interesting one.

What I Actually Built — And What I Didn't

I want to be honest here because I've read enough contest posts that oversell what was built. ( we all have, no names, you know who you are ) I'm not going to do that to you.

TripSync runs on Flask on Render. The public site at tripsync-ilao.onrender.com runs three AI modes — Cloud AI via Groq, Gemma 4 Expert via the Gemini API, and Local AI via Ollama on your own machine. Live users can experience Gemma 4 right now in Expert mode without any local setup.

The local Ollama mode requires cloning the repo and running it yourself. That's the honest truth about what "local" means — it runs on your machine, not mine.

Here's what I added:

1. Pulled Gemma 4 via Ollama

ollama pull gemma4

9.6GB. I watched the progress bar and thought — this is the whole model. On my laptop. No monthly bill. ( I actually said "wild" out loud to nobody at 2pm on a Saturday, the housecoat was involved )

2. Added dedicated local and Gemma API endpoints in server.py

@app.route('/api/tripsync-local', methods=['POST'])
def tripsync_local():
    data = request.get_json()
    prompt = build_prompt(data)
    result = call_ollama(prompt)
    if not result:
        return jsonify({"error": "Local AI unavailable — is Ollama running?"}), 503
    parsed = extract_json_safe(result)
    if not parsed:
        return jsonify({"error": "Could not parse response"}), 500
    return jsonify(parsed)

3. Added a three-mode toggle to the UI

One click cycles through Cloud AI, Gemma 4 Expert, and Local AI. Persists in localStorage. ( one click, that's it, I love when things are actually simple )

Three changes. Three modes. I'm a vibe coder — if I can't understand what I'm building, I can't build it. Simple on purpose. Always. ( complexity is not a flex, it's a future problem you're creating for yourself )

What Actually Happened When I Ran It

Like anything in life — the first run, the startup, needs to pre-fill all the empty spaces. Cold start on a 9.6GB model takes 30-45 seconds. The spinner spins. You sit there. You wonder if you broke something. ( you probably did break something, but not this time )

Then it responds.

And after that it's just use and maintain. Once Gemma 4 is warm in memory: under 1 second per query.

I sat back on a Saturday afternoon with the light coming through the window and just looked at the screen for a moment. A model running entirely on my MacBook — no internet required for the inference — returning detailed travel itineraries faster than a cloud API call. ( I may have done a small fist pump, I'm not confirming or denying )

I tested it with Thailand. Seven days, solo traveller, flying from Toronto, budget accommodation. Chiang Mai came back as the top result. I've actually been to Chiang Mai. The recommendations — Doi Suthep sunrise, Khao Soi noodles, the elephant sanctuary, the night bazaar — those are real. That captures the vibes of that city in a way that only someone who's actually been there would recognize. ( Gemma 4 has not been to Chiang Mai, and yet, somehow, it gets it )

I expected to write a post about why it didn't quite work. That's not the post I'm writing. ( surprised myself on this one )

The API Cost Math

On Groq's free tier: roughly 250-333 TripSync searches per day before hitting limits. For a travel app with real traffic that's nothing — a slow Tuesday afternoon. ( a very slow Tuesday )

With Gemma 4 running locally: unlimited. Zero marginal cost. The only constraint is hardware I already own.

The honest caveat: my local Gemma 4 runs on my MacBook, not on Render's servers. Getting it in front of real public users as a true local experience requires either a VPS with enough RAM or the user running it themselves. What I've proven is that Gemma 4 is production-quality for this use case. The remaining work is infrastructure, not model capability. ( important distinction, write that down )

The Privacy Thing I Didn't Expect to Care About

I travel a lot. I type things like "romantic trip for two, budget $4,000, leaving Toronto in January, somewhere warm, private villa preferred" into travel tools.

That's personal. Budget. Travel dates. Who I'm with. Where I'm going.

Right now the cloud modes send that to an API. Those services have privacy policies. I'm not suggesting anything nefarious. But the data leaves my machine. ( it just does, that's the reality )

When I ran the same search in Local AI mode — knowing it went from browser to local Flask to Ollama on my CPU and back, never touching an external server — I felt something I didn't expect.

Relief. Genuine relief.

There's a version of TripSync I want to build where users who care about privacy actually have it as a real, working, verifiable option. Not a marketing claim. Gemma 4 makes that possible. Given that I built this tool for my own travel planning, that's not abstract to me. It's personal. ( built it for me, privacy matters to me, simple )

What's Next

Pre-warm on server start — eliminate that cold-start wait for local mode
Performance comparison UI — show users side-by-side response times, let them choose with real information
Production local deployment — VPS with enough RAM to serve Gemma 4 to real public users without local setup ( the goal, working toward it ) The live app: tripsync-ilao.onrender.com The code: github.com/Tripsync-justmeMedia/tripsync

What I'd Actually Tell Another Builder

Not a motivational poster. Real advice.

What do you enjoy?

Build something around that. Once that's done — if you enjoyed the process of building it, then building anything is possible for you. Keep going. If you didn't enjoy it, maybe the answer isn't more building. Maybe it's marketing, or social media, or brand building. Or maybe nothing digital at all — maybe it's time to focus on physical things. A real community with real people. Writing stories, keeping a blog. Getting outside and touching grass. ( seriously, touch grass, it resets everything )

When things feel tough — step away. Take a breath. It resets you mentally, and that's when most times the right choices come to the top. ( I have learned this the hard way more than once )

The point isn't to become a developer. The point is to find what gives you energy and do more of that.

For me, it's building things at night in my housecoat in Minesing with Gemma 4 running on my MacBook. Zero API bill. Zero regrets.

Be good to yourself and others. ☮️

William Commu — Just Me Media
Minesing, Ontario | Cottage in Bracebridge
@nightowl on DEV
TripSync: tripsync-ilao.onrender.com

DEV Community