hiyoyo

Posted on May 3

My Actual AI Stack for Building Apps in 2026 — All Free

#gemini #ai #programming #api

If this is useful, a ❤️ helps others find it.

I've shipped 7 Mac apps in the past year. Every AI feature in them runs on free tools.

Here's the exact stack — what I use, why, and where the limits are.

Cloud AI: Gemini API (Google AI Studio)

What: Gemini 2.5 Flash Preview via REST API
Cost: Free tier — 500 requests/day, no credit card
Use for: Log diagnosis, document analysis, text classification, anything needing strong reasoning

The free tier is genuinely sufficient for developer tools with intermittent AI use. I've never hit the daily limit in normal usage.

Get a key at aistudio.google.com — takes 2 minutes.

Local AI: Ollama

What: Run open-source LLMs locally
Cost: Free, open source
Use for: Privacy-sensitive processing, offline use, high-volume tasks

# Install
brew install ollama

# Pull a model
ollama pull gemma2

# Run
ollama run gemma2

Models I actually use:

gemma2 — good general reasoning, runs on 8GB RAM
qwen2.5-coder:1.5b — fast code autocomplete, tiny footprint
qwen3:8b — best quality/size ratio for chat

Local AI for Code: Ollama + Continue.dev

What: VS Code extension that uses your local Ollama models for autocomplete
Cost: Free
Use for: Code completion without sending code to any cloud

// .continue/config.json
{
  "models": [{
    "title": "Qwen 2.5 Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  }],
  "tabAutocompleteModel": {
    "title": "Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  }
}

OCR: Apple Vision Framework

What: macOS built-in OCR engine
Cost: Free, ships with every Mac
Use for: Extracting text from scanned PDFs, images

No API call. No internet. Runs entirely on-device via a Swift sidecar in Tauri.

The decision rule

Data is sensitive (medical, legal, financial)?
  → Ollama or Apple Vision (local only)

Need strong reasoning, complex analysis?
  → Gemini API (with PII filtering)

Need code autocomplete?
  → Ollama + Continue.dev

Need OCR on Mac?
  → Apple Vision Framework

Total monthly cost

$0.

Not "free tier with a credit card on file." Actually zero. No payment method required for any of these.

Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault
X → @hiyoyok

Top comments (6)

Anguishe • May 3 • Edited

This is almost identical to my setup, but I am very new. I'm running a q4_K_M model of qwen2.5-coder or qwen3-coder-next's q5_K_M thought i'm running on linux, vscode continue.dev

I'll say again, i'm super new, and its probably an issue with my prompting or settings, but I am finding that after I create a generalized framework and front-end, of what i would consider to be a pretty straight forward app. That either continue / the LLM just becomes bloated and ends up breaking everything during minor edits or additions or looping for literally hours.

Is there anything specific you've done to combat this issue, or is it just a me thing? 🤣

Thanks for this post, though. It allowed me to see that I am indeed close. I gave up last week and started building a basic website instead.

hiyoyo • May 4

Definitely not just you — this trips up almost everyone starting out with local LLMs!

The short answer: context window rot. Once you've been in the same chat for a while, the model loses track of the big picture and starts contradicting itself. It's not your prompting — it's just how these models work.

This is actually a key difference from paid cloud models like Claude or GPT. Those have longer context windows and smarter handling of long conversations. With local LLMs, context management is entirely your responsibility — the model won't clean up after itself.

Quickest fix: treat each task as a fresh session. When something breaks and the LLM starts looping, don't keep pushing — open a new chat, paste only the relevant file, and describe exactly what you want changed.

Also worth trying: write a short CONTEXT.md in your project with the app's structure and rules, and paste it at the top of each new session. Forces you to think clearly AND gives the model a clean anchor.

Your setup (qwen3-coder q5_K_M) is genuinely good — you're not limited by hardware here. Local LLMs just require a bit more discipline from the user side to get the best out of them. 😄

Glad the post helped — good luck with the build!

Anguishe • May 4

Thanks for the tips! I have implemented a few variation of each one you mentioned. I'm very glad to know that if I am able to start generating a bit of passive income, that it would be well vested to check out some subscription models for more than just these reasons, too.

I've tried creating a new session for every task, while have tailored .md files and rules so that the context remains as "un-bloated" as possible.

I have a decent setup to definitely do some learning and practicing. 1TB SSD, i7 8600, RTX 2070, 64G RAM. 8gigs of vRAM is cauding my bottleneck which i'm sure that it for you too.

I have my 2020 macbook pro 8G M1 in front of my 2 screens to the PC. I've considered incorporating it in somehow to take a bit of the load off. Glad i'm pushing in the right direction.

Thanks again! 😊

hiyoyo • May 4

The .md rules file approach is exactly right — you're already thinking like someone who's been doing this a while!

And yes, 8GB VRAM is the real bottleneck on the RTX side. But here's the thing — your M1 MacBook might actually be more useful than you think. Ollama on Apple Silicon uses Unified Memory, so the full 8GB is shared between CPU and GPU with no hard split. It handles mid-size models surprisingly well, and for lighter tasks it can genuinely take load off your main machine.

Also worth checking out: Google Antigravity (antigravity.google). It's Google's new agent-first IDE, free during public preview, with generous Gemini usage included. The big advantage here is that it runs on cloud-side LLMs — so you don't need beefy local hardware at all. Works on Linux, Windows, and macOS, so both your machines are covered. One heads-up though — as with most preview-stage tools, there's a chance your inputs are used for model training, so I'd avoid putting sensitive or commercial code through it.

As for paid models — honestly, even occasional use of Claude or GPT for the tricky architectural decisions is worth it. You don't need a full subscription right away; just knowing when to reach for the right tool makes a big difference.

Sounds like you're pushing in exactly the right direction. Good luck with the build! 😄

Anguishe • May 4

Awesome!

I was getting a bit intimidated thinking that my approaches, because I tried several, were all wrong. It's very relieving and reassuring to know that I wasn't wasting time, and that I, indeed, have a foundation to build on that was all productive and correct 🤩

Even with the M1 8G, I wasn't underrating it much at all because I've heard a couple places that the MacBooks ability to do what you're explaining here, so that's very convenient.

And I think you just pushed me to dedicate a bit of time one day this week to giving Google's Antigravity a shot. I've heard things here and there about it, and all good so far. One frustration of mine that prevents me from trying new software or apps is checking it out and getting into the groove of things and having a good time learning, and then BOOM, you hit a paywall. I kinda just assumed that I would hit one early on with Antigravity because I figured that dealing with Cloud models would blow through tokens and such.

Still learning a lot though!

Thanks for the advice! It's all been phenomenal. 🙃

Cheers!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.