Junior Martins

Posted on May 12

Iris: an offline visual assistant in Brazilian Portuguese powered by Gemma 4

#gemmachallenge #devchallenge #gemma #android

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4.

TL;DR

Iris is an Android visual assistant for blind and low-vision users.
It describes what the camera sees, out loud, in Brazilian Portuguese.
Gemma 4 multimodal runs 100% on the phone via LiteRT-LM — no cloud, no telemetry, no INTERNET permission.
Three intent-specific modes: Continuous, Question, Reading.

What I Built

I built Iris, an offline Android visual assistant for blind and low-vision users.

The interaction is intentionally simple: point the phone camera at something, tap the screen, and Iris speaks what it sees in Brazilian Portuguese.

Iris has three modes:

Continuous Mode describes the scene in front of the user.
Question Mode lets the user ask a spoken question about the current camera image.
Reading Mode reads visible text aloud, such as a label, medicine box, sign, or package.

A fourth control, Repeat, replays the last description — useful for users who could not catch it all the first time.

The core constraint was not only "can Gemma 4 understand the image?" — the real question was whether this could still be useful when the camera is looking at private, everyday scenes and the user may not have a reliable network connection.

That led to the main technical decision: everything runs on the phone.

No server. No cloud API. No telemetry. No INTERNET permission in the Android manifest.

Demo

The demo shows Iris running on a Galaxy S21 with airplane mode visible. The right side of the video is the actual phone screen; the left side explains what each mode is doing.

The video is not meant to hide the on-device latency. On a Galaxy S21, the first spoken sentence arrives in roughly 25–30 seconds after the tap. The wait is part of the point: Gemma 4 is doing the multimodal work locally, on the phone, with no help from a server.

Code

Repository:

https://github.com/juniormartinxo/iris

The project is a native Android app built with Kotlin, Jetpack Compose, CameraX, Android TTS/STT, and LiteRT-LM.

Why This App Needed To Be Offline

A visual assistant sees the user's world.

That can mean a medicine label, a document on a desk, a room at home, a person nearby, a screen, or an object the user is trying to identify. For this kind of app, "send the image to a server" is not a neutral implementation detail. It changes the privacy model of the product.

I wanted Iris to have a stronger boundary:

camera frames stay on the device;
speech recognition uses Android's on-device recognizer;
speech output uses Android Text-to-Speech;
Gemma 4 runs through LiteRT-LM locally;
the app does not request Internet access.

The Android manifest only asks for the permissions the app needs to see and hear:

<uses-permission android:name="android.permission.CAMERA" />
<uses-permission android:name="android.permission.RECORD_AUDIO" />

There is no android.permission.INTERNET.

That means the app cannot silently fall back to a network call unless a future version explicitly changes the permission model.

How I Used Gemma 4

Iris uses Gemma 4 E2B as the primary model, packaged as a .litertlm bundle and loaded with LiteRT-LM on Android.

I chose E2B because the target device is a phone, not a workstation. The challenge prompt asks builders to be intentional about model selection, and for this project the priority was clear: local multimodal inference with the smallest model that could still make the experience meaningful.

The app can also look for an E4B bundle as a fallback, but the loading order prefers E2B:

val MODEL_CANDIDATES: List<Pair<String, String>> = listOf(
    "E2B" to "gemma-4-E2B-it.litertlm",
    "E4B" to "gemma-4-E4B-it.litertlm",
)

For the current demo, E2B is the right trade-off: it keeps the app viable on devices like the Galaxy S21 and A55 while still giving Iris native multimodal understanding.

The app sends Gemma 4 one camera frame plus a mode-specific prompt:

in Continuous Mode, the prompt asks for a short scene description focused on obstacles, relevant objects, visible text, and people;
in Question Mode, Android's on-device speech recognizer captures the user's question, then Gemma 4 answers using the current image;
in Reading Mode, the prompt asks Gemma 4 to read visible text from top to bottom and left to right.

All prompts ask for Brazilian Portuguese output because the product is designed for PT-BR users first.

Architecture

The runtime path is deliberately small:

CameraX preview
  -> capture current frame
  -> downsample image to a 512px longest edge
  -> run frame quality checks
  -> Gemma 4 E2B through LiteRT-LM
  -> sentence buffer
  -> Android Text-to-Speech

There are a few details in that pipeline that matter.

1. Frame quality checks before inference

On-device multimodal inference is expensive enough that Iris should not waste a run on a useless image.

Before calling Gemma 4, the app calculates a small set of image quality signals:

brightness;
luminance variance;
a simple blur score.

If the image is too dark, too bright, too empty, or too blurry, Iris speaks a direct instruction like:

"Imagem muito escura. Verifique a iluminação ou se há algo cobrindo a câmera."

That is better than waiting through a slow inference only to get a weak answer.

2. Sentence streaming into TTS

LiteRT-LM returns text chunks. Iris does not wait for the entire response before speaking.

A small SentenceBuffer collects chunks until it sees a sentence boundary, then sends each sentence to Android TTS. The user hears the first useful sentence as soon as possible, while the model can still finish the rest of the response.

For accessibility, this is not just a polish detail. Silence during a long local inference can feel like the app froze. Iris speaks "Aguarde." as a progress beacon every 15 seconds while inference is still running. The moment the first sentence arrives from the model, the loading overlay disappears and the description starts streaming into the screen — one sentence at a time — so visual and audio feedback stay in sync.

3. Modes are prompt engineering exposed as UI

I did not want one generic "describe" button to handle every situation.

Blind and low-vision users often know what they need:

"What is in front of me?"
"Is there a silver object here?"
"Read this medicine box."

The three modes let the UI choose the prompt shape before the model runs. This keeps the responses shorter, more predictable, and easier to listen to.

4. TalkBack-aware behavior

Iris is built around audio, but Android accessibility already has an audio layer: TalkBack.

When TalkBack is enabled, Iris avoids speaking over UI announcements. When TalkBack is not enabled, Iris uses its own TTS announcements for mode changes, loading states, errors, and tutorial flow.

Every primary control has a contentDescription, including:

Continuous Mode;
Question Mode;
Reading Mode;
Repeat.

This lets the same interface work for screen-reader users and for users who are not running TalkBack.

Performance on the Galaxy S21

Step	Time / size
Cold start to "Ready" (model load)	~10–30 s
Tap → first spoken word	~25–35 s
Tap → full response done	~35–50 s
Memory while idle (model loaded)	~3 GB
APK size (debug)	~99 MB
Model bundle (sideloaded `.litertlm`)	~2.59 GB

These are not best-in-class numbers. They are the cost of doing multimodal inference fully on-device on a 2021 phone. The UX is designed around making that wait legible, not invisible — that is what the progress beacon and the sentence streaming are for.

Numbers above are approximate, taken on a single Galaxy S21 unit. They will vary by device, by ambient thermal state, and by how much memory the rest of the system is using.

What Gemma 4 Unlocked

The important thing about Gemma 4 here is not only that it can caption an image.

It lets Iris treat visual assistance as a flexible language problem:

a scene can be summarized;
a user question can be grounded in the image;
visible text can be read in context;
the output can be constrained to short, spoken PT-BR sentences.

Without a multimodal model, this app would likely become a collection of separate systems: one detector, one OCR pipeline, one classifier, one rules engine, and a lot of glue code. That can be the right architecture for a mature product, but it is much heavier for a fast prototype.

Gemma 4 made it possible to build one coherent loop:

image + intent -> useful spoken answer

What Was Hard

The hardest part was not wiring the camera to the model. It was making the experience feel reliable enough for an audio-first user.

Some examples:

The app must tell the user when the model is still working.
A second tap should cancel or replace the previous flow cleanly.
If speech recognition is unavailable offline, the app should explain that instead of failing silently.
If the camera points at a blank wall, the model should not invent a detailed scene.
If the device is too hot, the app should refuse another expensive inference instead of making the phone worse.

Some other details would have been invisible in a tech demo but mattered for a real app: guarding the camera tap area against overlay-based tap-jacking, capping the crash log file size so a failure loop cannot fill the disk, synchronizing access to the speech recognizer so it cannot be torn down mid-utterance. Each one came from imagining what a confused or panicked user would experience.

Iris is still a prototype, but these details are the difference between a tech demo and something that starts to behave like a product.

Limitations

The current version has real constraints.

First, local inference on a Galaxy S21 can take a while. That is the cost of running a multimodal model offline on a phone-class device. The app is designed to keep the user informed during that wait, but latency is still the biggest UX limitation.

Second, Iris is not a navigation or safety-critical tool. It can describe what the camera sees, but it should not be used as the only source of truth for crossing streets, avoiding hazards, or making medical decisions.

Third, Reading Mode uses Gemma 4's multimodal ability rather than a dedicated OCR engine. That keeps the architecture simple and demonstrates what the model can do, but a production version might combine Gemma 4 with a specialized OCR pipeline for speed and fidelity.

Try It Yourself

The model is sideloaded because it is large. The app expects the LiteRT-LM bundle under the app's external files directory.

Short version:

./gradlew :app:assembleDebug

adb install -r app/build/outputs/apk/debug/iris-0.1.0-debug.apk

adb push gemma-4-E2B-it.litertlm \
  /sdcard/Android/data/com.iris/files/

Model source:

https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

The README includes the full setup notes, including Android version, memory, storage, and offline PT-BR TTS/STT requirements.

What's Next

The next improvements I would make are:

test the same build across more Android devices;
improve latency with better model/runtime configuration;
add a more explicit first-run checklist for offline language packs;
explore a hybrid Reading Mode that uses dedicated OCR when speed matters;
keep the permission model strict: no Internet unless the product direction changes openly.

Acknowledgments

The Google AI Edge team for LiteRT-LM, which makes running Gemma 4 on Android tractable.
litert-community on Hugging Face for the .litertlm model bundles.
Anthropic Claude Code was used as a pair-programming assistant during development of the Android code, the UX logic, and parts of this article.

Closing

Iris is small, but it represents the kind of app I want to see more often: local-first, accessibility-first, and built for a real language community instead of assuming English by default.

Gemma 4 made the prototype possible because it brought multimodal understanding close enough to the device. LiteRT-LM made it possible to run that loop inside Android.

The result is simple:

The camera sees. Gemma 4 understands. Iris speaks.

And it does that without sending the image anywhere.

DEV Community