蔡俊鹏

Posted on May 11

Run Open-Source LLMs Locally: From Ollama to DeepSeek and Build Your Private AI

#opensource #deepseek #llm

Foreword

In 2026, open-source LLMs aren't lab experiments anymore. Meta's Llama 4, Alibaba's Qwen 3, DeepSeek-R1 from China — they've caught up with or beaten closed-source models on many benchmarks. And thanks to tools like Ollama and llama.cpp, anyone with a mid-range computer can run their own AI locally. No GPU clusters, no API subscription fees. Even a 16GB MacBook can handle 7B~13B parameter models.

Let's cut to the chase: hardware requirements, tool choices, deployment steps, and four gotchas I've run into.

Chapter 1: Why Run LLMs Locally

1. Data Privacy

You ship your code, contracts, or medical records to a cloud API, you lose control of where your data goes. Run locally, everything stays on your machine. Nobody intercepts your prompts or model outputs.

2. Latency and Tokens

Cloud APIs have network lag and rate limits. A local model lives in your VRAM — response is instant, no queue, no token billing. Ask as many questions as you want, no "overage" messages.

3. Works Offline

On a plane, subway, or a dead-zone meeting room — if your laptop is open, AI is available. For unstable networks, local deployment is pretty much the only option.

Chapter 2: Hardware Requirements

The "LLMs need a monster GPU" stereotype keeps a lot of people from even trying. With quantization (GGUF/INT4), you can cut VRAM needs to a quarter of the full-precision model.

Quick Reference

Key Points

A 16GB Mac is surprisingly good for this — unified memory means the CPU can directly address "VRAM." 16GB runs quantized 7B~13B models comfortably.
6GB VRAM works too. DeepSeek-R1 1.5B distilled, INT4 quantized, takes about 1GB. Dialogue quality holds up fine.
CPU-only inference with llama.cpp is slow but usable. A Raspberry Pi 5 runs Mistral-7B-Q4 at ~1.2s/token. A regular laptop does much better.

Chapter 3: Tool Comparison

1. Ollama — Best All-Rounder

Best for developers who want something that just works.

Ollama has the strongest community and lowest learning curve. It's a model manager plus inference engine, supports macOS, Windows, Linux.

ollama run deepseek-r1:7b
ollama run qwen3:7b

Exposes an OpenAI-compatible REST API at localhost:11434:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "deepseek-r1:7b", "prompt": "Implement a simple web server in Python"}
)
print(response.json()["response"])

Pair it with Open WebUI for a ChatGPT-style interface.

2. llama.cpp — Low-Spec Hero

Best for weak hardware or CPU-only setups.

Written in C/C++, heavily optimized for low-end devices. Supports CPU+GPU hybrid inference, and GGUF quantization shrinks model sizes dramatically.

If your MacBook has Intel integrated graphics, llama.cpp is about the only way to run a 7B model.

3. vLLM — For Production

Best for serving APIs at high concurrency.

PagedAttention plus dynamic batching gets you 10x+ throughput over standard frameworks. First choice if you need to expose a local model as a service.

4. LM Studio — No Terminal Needed

Best for users who don't want to touch a command line.

Search, download, and run models through a clean GUI. Zero coding. Works well on Windows and macOS.

Chapter 4: Hands-On — Running DeepSeek-R1 with Ollama

Full walkthrough, 15 minutes start to finish.

Step 1: Install

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Step 2: Pull and Run

ollama pull deepseek-r1:7b
ollama run deepseek-r1:7b

First download is about 4~5GB (INT4 quantized). Speed depends on your connection.

Step 3: Open WebUI (Optional)

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 for a chat interface.

Step 4: API Integration

curl http://localhost:11434/api/chat \
  -d '{
    "model": "deepseek-r1:7b",
    "messages": [{"role": "user", "content": "Explain what RAG is?"}]
  }'

Chapter 5: Four Gotchas

Gotcha 1: Models Eat Your System Drive

Ollama puts models on your system drive by default. Each is 4~10GB. A few and your C: drive is full.

Fix: set the storage path before you start.

export OLLAMA_MODELS=/path/to/your/models

Gotcha 2: OOM from Insufficient VRAM

You download a 70B model, it won't run, and it freezes your computer.

Fix: download quantized versions. A 7B model at Q4_K_M needs 3.5~4GB. Never grab the FP16 full-precision version — native 7B needs 14GB VRAM.

Not sure about your hardware? Check https://www.canirun.ai/ first.

Gotcha 3: Download Speeds That Make You Cry

Downloading from HuggingFace or Ollama's official sources can be painfully slow in certain regions.

Fix: use mirrors or grab GGUF files from local providers like ModelScope.

Gotcha 4: Disappointing Output Quality

Same prompt, your local model gives garbage answers.

Fix:

Use the right quantization (Q4_K_M beats Q2 by a lot)
Tweak parameters (temperature, top_p)
DeepSeek-R1 series for reasoning tasks
CodeLlama or DeepSeek-Coder for coding tasks

Conclusion

Running open-source LLMs locally isn't a geek-only thing anymore. It's a skill every developer should have. Privacy, zero latency, offline access, no usage caps — the benefits keep stacking up.

Hardware isn't the barrier: 6GB VRAM works, 16GB Mac works, even a Raspberry Pi gets a seat at the table. Tools are mature: Ollama for one-click setup, LM Studio for GUI lovers, vLLM for production loads.

Why send your AI to the cloud when your machine is right there?

Got an ordinary laptop? Start with Ollama + DeepSeek-R1 1.5B. Download takes a few minutes. The moment it runs, you'll know what "AI under your control" feels like.

Original address:

https://auraimagai.com/en/run-open-source-llms-locally-from-ollama/

DEV Community