DEV Community

Cover image for How I Taught My Offline AI to Remember, Watch, and Warn, Without Any Cloud (Part 2)
Marcin Firmuga
Marcin Firmuga

Posted on

How I Taught My Offline AI to Remember, Watch, and Warn, Without Any Cloud (Part 2)

Part 1 covered how hck_GPT routes messages through 9 layers and decides between rules and a local LLM. If you missed it:
Part 1 - Intent Scoring, Hybrid Routing, Temperature per Intent.

That was the brain. This is the memory and the instincts.

Part 1's AI was reactive. You ask, it answers. Close the app, everything is gone. No history. No learning. No initiative.

That's not how a PC companion should work. If I've been running PC Workman for two weeks, it should know that my CPU averages 28% and today's 67% is unusual. It should notice that Chrome has been eating 2GB RAM for an hour and mention it before I ask. It should remember my GPU model without me telling it twice.

So I built three systems that Part 1 didn't have: a persistent knowledge base that survives restarts, a metrics store that snapshots hardware data every 5 minutes into SQLite, and a proactive monitor that watches your system and pushes alerts without being asked.

All offline. All local. Your data never leaves your machine.


1. The Knowledge Base, AI That Remembers Across Restarts

Part 1 had session memory,
a Python dict that dies when the app closes. Useful for
"we talked about RAM 3 messages ago" but useless for
"your GPU is an RTX 3060 with 6GB VRAM"
which shouldn't need re-scanning every launch.

Four tables, each with a different job:

CREATE TABLE IF NOT EXISTS hardware_profile (
    key     TEXT PRIMARY KEY,
    value   TEXT NOT NULL,
    updated REAL NOT NULL
);

CREATE TABLE IF NOT EXISTS usage_patterns (
    metric  TEXT PRIMARY KEY,
    value   TEXT NOT NULL,
    updated REAL NOT NULL
);

CREATE TABLE IF NOT EXISTS user_facts (
    key        TEXT PRIMARY KEY,
    value      TEXT NOT NULL,
    source     TEXT NOT NULL DEFAULT 'detected',
    confidence REAL NOT NULL DEFAULT 1.0,
    created_at REAL NOT NULL
);

CREATE TABLE IF NOT EXISTS conversation_log (
    id         INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id TEXT    NOT NULL,
    timestamp  REAL    NOT NULL,
    role       TEXT    NOT NULL,
    message    TEXT    NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

hardware_profile stores things that rarely change:
CPU model, GPU name, VRAM, motherboard, RAM speed, OS version.

Scanned once via psutil + WMI, then cached.
The hardware_is_fresh() method checks if the last scan was within 24 hours - if yes, skip the scan.
No wasted resources on startup.

usage_patterns stores things that change slowly:
average CPU load, peak hours, top apps, detected use-case
("gaming" vs "development" vs "office").
Updated periodically from the stats engine.

user_facts stores things the AI inferred or the user stated:
preferred language, PC usage type, whether they play games.
Each fact has a confidence score — detected facts start at 1.0, inferred facts can be lower.
This is how hck_GPT knows to greet you in Polish without asking every session.

conversation_log keeps the last 500 messages across sessions
(pruned monthly). Not for replaying conversations, for pattern detection. "User asks about temperature every Monday" is a signal.

The knowledge base builds a context summary for every AI response:

def build_knowledge_summary(self) -> str:
    hw    = self.get_all_hardware()
    facts = self.get_all_facts()
    lines = []

    if hw:
        lines.append("Hardware:")
        for key, label in _HW_LABELS:
            if key in hw and hw[key] is not None:
                lines.append(f"  {label}: {hw[key]}")

    if facts:
        lines.append("User facts:")
        for k, v in list(facts.items())[:8]:
            lines.append(f"  {k}: {v}")

    return "\n".join(lines) if lines else "(knowledge base empty)"
Enter fullscreen mode Exit fullscreen mode

This gets injected into every Ollama prompt. The LLM doesn't ask "what GPU do you have?", it already knows.
From the first message of every session.


2. Metrics Store, 90 Days of Hardware History in SQLite

Session memory remembers what happened in the last 30 minutes.
The knowledge base remembers your hardware profile.
But what about "was my CPU hotter yesterday than today?"

The Metrics Store answers that. Every 5 minutes, a background thread snapshots 20+ sensor values into SQLite:

SNAPSHOT_INTERVAL = 300   # 5 minutes between snapshots
RETENTION_DAYS    = 90    # auto-prune after 90 days

CREATE TABLE IF NOT EXISTS deepmonitor_snapshots (
    id            INTEGER PRIMARY KEY AUTOINCREMENT,
    ts            REAL    NOT NULL,
    date_str      TEXT    NOT NULL,
    cpu_load      REAL,
    cpu_temp      REAL,
    cpu_mhz       REAL,
    cpu_power     REAL,
    gpu_temp      REAL,
    gpu_load      REAL,
    gpu_vram_pct  REAL,
    gpu_power     REAL,
    ram_pct       REAL,
    swap_pct      REAL,
    mb_temp_sys   REAL,
    mb_volt_12v   REAL,
    mb_volt_5v    REAL,
    mb_volt_33v   REAL,
    disk_json     TEXT
);
Enter fullscreen mode Exit fullscreen mode

This table shares the same hck_stats.db file as the main stats engine — one database, WAL mode, concurrent reads and writes without locks.

The writer loop staggers its first write by 60 seconds so the UI finishes loading first:

def _writer_loop(self) -> None:
    self._stop.wait(60)  # let UI boot
    while not self._stop.is_set():
        try:
            self._save_snapshot()
            self._prune_old_rows()
        except Exception as e:
            log.debug("writer error: %s", e)
        self._stop.wait(SNAPSHOT_INTERVAL)
Enter fullscreen mode Exit fullscreen mode

Auto-prune runs after every snapshot — rows older than 90 days get deleted. Database stays small (~5-10 MB per month).

The critical part is what happens on startup:

def _load_historical_baselines(self) -> None:
    since = time.time() - 7 * 86400
    row = conn.execute("""
        SELECT
          AVG(CASE WHEN cpu_load >= 0 THEN cpu_load END) AS cpu_av,
          AVG(CASE WHEN ram_pct  >= 0 THEN ram_pct  END) AS ram_av,
          MIN(CASE WHEN cpu_load >= 0 THEN cpu_load END) AS cpu_lo,
          MAX(CASE WHEN cpu_load >= 0 THEN cpu_load END) AS cpu_hi,
          COUNT(*) AS n
        FROM deepmonitor_snapshots WHERE ts >= ?
    """, (since,)).fetchone()

    live_sensors.update({
        "_hist_cpu_avg_7d": round(row["cpu_av"], 1),
        "_hist_ram_avg_7d": round(row["ram_av"], 1),
    })
Enter fullscreen mode Exit fullscreen mode

On app boot, the store loads 7-day min/max/avg baselines into live memory. From the very first message, hck_GPT can say "your CPU is at 67% — but your 7-day average is 28%, something is off." Not because it guessed. Because it has 2,016 data points from the last week (288 snapshots/day × 7 days).

Without this, the AI would need to run for hours before it could make any comparison. With it, historical context is available from second one.

The public API is simple — two methods cover 90% of use cases:

rows = metrics_store.get_history(hours=24)     # raw snapshots
summary = metrics_store.daily_summary(days=7)  # per-day aggregates
Enter fullscreen mode Exit fullscreen mode

3. The Proactive Monitor, AI That Speaks First

This is the part that makes people say "wait, it does WHAT?"

Most AI assistants wait for input. You type, they respond. Passive.

hck_GPT's Proactive Monitor is a daemon thread that runs every 45 seconds, checks system state, and pushes alerts to the chat panel without anyone asking.

CPU_HIGH_PCT     = 85.0
CPU_CRIT_PCT     = 95.0
RAM_HIGH_PCT     = 88.0
RAM_CRIT_PCT     = 93.0
DISK_LOW_GB      = 4.0
THROTTLE_RATIO   = 0.60
CHECK_INTERVAL_S = 45
MIN_GAP_SAME_S   = 300   # don't repeat same alert for 5 min
Enter fullscreen mode Exit fullscreen mode

Seven conditions monitored:

  1. CPU sustained high, not a spike, sustained.
    Two consecutive checks above 85% before alerting.
    One check could be a game loading.
    Two means something is wrong.

  2. CPU critical - above 95%. Immediate alert.
    No waiting for second check.

  3. RAM high - above 88%. Suggests checking what's eating memory.

  4. RAM critical - above 93%. Pagefile is getting hit.

  5. CPU throttling - current frequency divided by max frequency below 60%. Your CPU is being thermally limited.

  6. Disk nearly full - any partition below 4 GB free.

  7. Long session - PC running for many hours. Gentle reminder.

Every alert has a bilingual message pool, Polish and English versions with slight variations so the same alert doesn't read identically twice:

_MSGS = {
    "cpu_high": {
        "en": [
            "hck_GPT: ⚠ CPU sustained at {val}%. Type 'top processes' to see who's responsible.",
            "hck_GPT: CPU {val}% — something's eating it. Type 'top' to find out what.",
            "hck_GPT: Heads up — CPU at {val}%. Expected, or something sneaky in the background?",
        ],
        "pl": [
            "hck_GPT: ⚠ CPU na {val}% od dłuższego czasu. Wpisz 'top procesy' żeby zobaczyć winowajcę.",
            "hck_GPT: CPU {val}% — coś go zjada. Jeśli to nie Ty, to kto? Wpisz 'top'.",
        ],
    },
}
Enter fullscreen mode Exit fullscreen mode

The anti-spam logic is critical. Without it, a sustained CPU spike would flood the chat with warnings every 45 seconds:

MIN_GAP_SAME_S = 300  # 5 minutes between same alert type

def _should_push(self, key: str) -> bool:
    now = time.time()
    last = self._last_push.get(key, 0)
    if now - last < MIN_GAP_SAME_S:
        return False
    self._last_push[key] = now
    return True
Enter fullscreen mode Exit fullscreen mode

The push mechanism uses callbacks — the UI registers a function, the monitor calls it from the background thread:

proactive_monitor.register_push(
    lambda msg: root.after(0, lambda: panel.add_message(msg))
)
Enter fullscreen mode Exit fullscreen mode

root.after(0, ...) schedules the message on tkinter's main thread. Background thread never touches the GUI directly. No race conditions. No crashes.

There's also a silent banner callback for non-intrusive status updates — "All quiet" vs "CPU 87% SPIKE | 2 alerts" in the hck_GPT status bar.


4. Bilingual Vocabulary — 854 Lines, Zero Translation API

Part 1 mentioned that Polish and English patterns are "defined separately." Here's what that actually looks like at scale.

vocabulary.py is 854 lines. 25+ intents. Each intent has a list of trigger patterns in both languages, mixed together:

INTENT_PATTERNS = {
    "hw_cpu": [
        # Polish tokens
        "procesor", "rdzeń", "rdzenie", "taktowanie",
        # English tokens
        "cpu", "processor", "cores", "boost",
        # Polish multi-word (high bonus)
        "jaki procesor", "jaki mam procesor", "ile rdzeni",
        "co mam za procesor", "powiedz mi o procesorze",
        # English multi-word
        "what cpu", "my cpu", "which processor",
        "tell me about my cpu", "cpu details",
    ],
}
Enter fullscreen mode Exit fullscreen mode

No translation step.
No language detection preprocessing.
Both languages live in the same list.
The intent parser scores them all the same way — exact match, prefix match, typo tolerance. If someone types "jaki mam procesor" they get the same intent (hw_cpu) with the same confidence as "what cpu do I have."

The scoring comment at the top of the file explains the math:

# Pattern scoring (in parser.py):
#   - Multi-word phrases:  len(words) * 1.5  (biggest bonus)
#   - Exact single token:  1.0
#   - Partial prefix:      0.4
#   - Normalised:          min(1.0, score / 3.0)
Enter fullscreen mode Exit fullscreen mode

Adding more multi-word phrases to an intent raises its confidence, making it more likely to hit the 0.60 threshold for the rule engine.
Ambiguous queries stay below threshold and go to Ollama.
The vocabulary file IS the tuning knob, no hyperparameters to optimize, just patterns to add.

This week I added 12 new intents based on questions from LinkedIn followers:

  • battery_drain - "Which process is draining my battery?"
  • session_compare - "Was it better yesterday?"
  • pc_changes - "What changed since yesterday?"
  • startup_safety - "Is it safe to disable this?"
  • browser_cache - "Is my browser slow because of caching?"
  • swap_analysis - "Which processes are using swap?"
  • network_usage - "Which process is using my network?" Each one: 10-20 patterns, Polish + English, with bilingual response handlers reading live data from psutil and the metrics store.

What Part 1 + Part 2 Give You Together

Layer Part 1 Part 2
Input processing Intent parser, confidence scoring Bilingual vocabulary (854 lines)
Decision 9-layer routing, hybrid engine
Response Rule engine + Ollama, temperature per intent
Short-term memory Session dict, trend tracking
Long-term memory User Knowledge Base (SQLite, AppData)
Historical data Metrics Store (snapshots q5min, 90-day retention)
Proactive behavior Daemon monitor (7 conditions, anti-spam, bilingual alerts)
Startup context 7-day baselines loaded on boot

Part 1 built a brain that responds. Part 2 gave it memory that persists, data that accumulates, and instincts that act without being asked.


What's Still Missing

Cross-session conversation context.
The knowledge base stores hardware profiles and usage patterns, but conversation topics don't persist. If you asked about RAM yesterday, today's session doesn't know.
The conversation_log table has the data — the response builder just doesn't query it yet.

Predictive alerts. The proactive monitor reacts to current state.
It doesn't predict. "Your RAM tends to spike around 9 PM on weekdays" is possible with the metrics store data - the SQL is straightforward, the pattern detection isn't built yet.

Automatic vocabulary expansion. Every unrecognized query requires manual pattern addition.

A feedback loop that logs low-confidence queries and suggests new patterns would cut maintenance time significantly.


The Offline-First Philosophy

Every feature in this article stores data locally. SQLite in AppData. Snapshots in the shared stats database.
Alerts computed from psutil readings. Nothing is sent anywhere.

**This isn't a limitation - it's a design choice.
**When your app monitors CPU temperatures, reads Windows registry, scans running processes, and tracks usage patterns over 90 days, the data is sensitive by definition.
Sending it to a cloud API for every interaction isn't just slow. It's a trust problem.

The optional Ollama integration (coming before v2.0) will give users a choice: stay fully offline with the Core AI, or enable a local LLM for smarter conversations.
The key word is "local." Even the LLM runs on your hardware.

Your data. Your machine. Your choice.


My Project

PC Workman is open source. MIT licensed.
The .exe doesn't require Python.

Download: GitHub Releases
Part 1: How I Built an Offline AI Assistant in Python

All my links: linktr.ee/marcin_firmuga

Currently doing Google's AI certification (Umiejętności Jutra 3.0 through SGH).
Curious how it'll change the way I think about this engine.


Marcin Firmuga | Python Developer | HCK_Labs
Building PC Workman publicly between retail shifts. 800+ hours, 28 GitHub stars, and an AI that now remembers what it told you last week.

Top comments (1)

Collapse
 
godaddy_llc_4e3a2f1804238 profile image
GoDaddy LLC

This is one of the most practical approaches to “personal AI” I’ve seen lately.
Most assistants today are still glorified request-response systems, while this actually behaves more like an intelligent local companion with memory, baselines, and proactive behavior.
The SQLite-based historical metrics store is especially smart — lightweight, transparent, and powerful enough to build real behavioral context over time.
I also like the offline-first philosophy here because system telemetry and usage patterns are exactly the kind of data people shouldn’t casually stream to the cloud 😄.
The proactive monitor feels less like a chatbot and more like a sysadmin roommate who occasionally taps your shoulder and says “hey… your RAM is having a crisis again” 😂.
The bilingual intent engine is another underrated engineering decision — simple architecture, low overhead, and surprisingly scalable.
Really solid systems-thinking throughout this project.