The landscape of mobile development is shifting beneath our feet. For years, the "Smart" in smartphone relied almost exclusively on the cloud. We sent a request, waited for a server in a distant data center to process it, and received a response. But with the advent of Gemini Nano and Google’s AICore, the intelligence is moving directly onto the silicon in our pockets.
Building a Chat UI for an on-device Large Language Model (LLM) like Gemini Nano is not just another exercise in creating a list of text bubbles. It is a fundamental departure from the traditional CRUD (Create, Read, Update, Delete) applications we’ve built for a decade. It requires a deep understanding of hardware orchestration, asynchronous data streams, and state management that can handle the heavy lifting of generative AI without freezing the user interface.
In this guide, we will dive deep into the architectural paradigms of on-device AI, explore why AICore is a game-changer for Android developers, and implement a production-grade chat interface using Jetpack Compose and Kotlin Coroutines.
(This article is based on the ebook On-Device GenAI with Android Kotlin)
The Architectural Paradigm of On-Device AI Interfaces
When you build a standard chat app—think WhatsApp or Slack—the data flow is discrete. You send a message, it hits a database, and a notification triggers a fetch on the other end. In the world of Generative AI (GenAI), this model breaks down.
The Challenge of the "Token Stream"
The core theoretical challenge in GenAI is managing what we call the Token Stream. LLMs do not generate sentences; they generate text one token at a time. If you were to wait for Gemini Nano to finish generating a 500-word response before displaying it, the user would be staring at a "Thinking..." spinner for five to ten seconds. In the world of modern UX, that is an eternity.
To solve this, your UI must be designed as a reactive sink. It needs to be capable of receiving a continuous, high-frequency stream of data and updating the display in real-time. This ensures a sense of immediacy, making the AI feel like it is "typing" its thoughts as they occur.
AICore: The System-Level AI Provider
Why can't we just bundle a model file in our APK and call it a day? The answer lies in the constraints of mobile hardware. LLMs are resource monsters. They demand massive amounts of RAM (often several gigabytes) and require direct, low-level access to the Neural Processing Unit (NPU).
If every app on a user’s phone bundled its own version of Gemini Nano, the device’s storage would vanish, and the RAM would be so fragmented that the OS would constantly kill background processes. Google’s solution is AICore.
AICore acts as a system-level service, much like CameraX or Google Play Services. It provides several critical advantages for the modern Android developer:
- Shared Memory Architecture: The model is loaded into system memory once. Whether the user is using your app, a notes app, or a messaging app, they all interface with the same resident model, drastically reducing the total memory footprint.
- Seamless Model Updates: Google can refine the model weights, improve safety filters, and optimize performance via Play Store updates to AICore. As a developer, you don't need to push a new APK just because the underlying LLM got smarter.
- Hardware Orchestration: This is perhaps the most vital role. AICore manages the handoff between the CPU, GPU, and NPU. It balances "tokens-per-second" against thermal throttling. It knows when to push the NPU to its limit and when to scale back to prevent the user's phone from becoming uncomfortably hot.
The Model Loading Analogy: It’s Not Just a Class
Loading a local LLM is a "heavy lift." To help visualize this, think of the initial loading process as being similar to a Room database migration.
When you perform a complex database migration, you are dealing with disk I/O, schema validation, and data integrity checks. If you do this on the main thread, the app hangs. Loading Gemini Nano involves allocating large contiguous blocks of VRAM, verifying model checksums, and "warming up" the NPU. If the model is not already resident in memory, the first request will experience a "cold start" latency.
Your UI must explicitly account for this. A professional AI app isn't just Loading or Success. It needs a state machine that handles Initializing, ModelLoading, Ready, and InferenceInProgress.
Connecting Modern Kotlin to AI Workflows
To implement this architecture, we leverage the latest features of Kotlin 2.x. These tools aren't just syntactic sugar; they are the engine that makes high-performance AI possible on mobile.
1. Kotlin Flow for Real-Time Streaming
Since Gemini Nano emits tokens incrementally, Flow is the non-negotiable choice for data transport. Specifically, we use Flow<String> to stream the response. Unlike a static List, a Flow allows the UI to append text to the last message bubble in real-time.
2. Coroutines and Dispatcher Management
AI inference is computationally expensive. While AICore handles the heavy lifting, the coordination of prompts and the processing of the resulting stream must happen on Dispatchers.Default. If you attempt to process these tokens on the Main thread, you will drop frames, and your beautiful Compose animations will stutter.
3. Kotlin Serialization for Prompt Engineering
Modern AI development relies heavily on structured prompts. Using kotlinx.serialization, we can define "Prompt Templates" as data classes. This ensures that the input sent to Gemini Nano is consistent, type-safe, and follows the specific formatting required for the model to understand context.
The State Machine of a Chat UI
Before we look at the code, we must define the state. A GenAI Chat UI is best represented as a Finite State Machine (FSM):
- IDLE: The user is typing. The system is waiting.
- PROMPTING: The request is sent to AICore. The UI shows a "Thinking..." indicator.
- STREAMING: Tokens are arriving. The UI is actively appending text to the latest message.
- COMPLETED: The LLM has emitted the
end_of_turntoken. The UI transitions back to a state where the user can send a follow-up. - ERROR: The model failed (e.g., safety filters triggered or Out-of-Memory). The UI must provide a recovery path.
Implementation: The Technical Stack
Let's look at how to build this. We will use Hilt for Dependency Injection to ensure our AI repository is a singleton, preventing multiple instances from attempting to lock the NPU hardware.
Gradle Dependencies
First, ensure your build.gradle.kts is equipped with the necessary libraries for MediaPipe (which powers the Gemini Nano integration) and Jetpack Compose.
dependencies {
// MediaPipe GenAI for Gemini Nano
implementation("com.google.mediapipe:tasks-genai:0.10.14")
// Jetpack Compose
implementation("androidx.compose.ui:ui:1.7.0")
implementation("androidx.compose.material3:material3:1.2.0")
implementation("androidx.lifecycle:lifecycle-viewmodel-compose:2.8.0")
implementation("androidx.lifecycle:lifecycle-runtime-compose:2.8.0")
// Hilt for Dependency Injection
implementation("com.google.dagger:hilt-android:2.51")
kapt("com.google.dagger:hilt-compiler:2.51")
// Coroutines & Serialization
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.8.0")
implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.3")
}
The Data Layer: Hardware-Aware Repository
The repository is where the "magic" happens. It abstracts the MediaPipe LlmInference engine and provides a clean Flow for the ViewModel to consume.
@Singleton
class OnDeviceChatRepository @Inject constructor(
@ApplicationContext private val context: Context
) {
private var llmInference: LlmInference? = null
suspend fun initializeModel(modelPath: String) = withContext(Dispatchers.Default) {
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(1024)
.setTemperature(0.7f)
.setTopK(40)
.build()
llmInference = LlmInference.createFromOptions(context, options)
}
fun generateResponseStream(prompt: String): Flow<String> = callbackFlow {
val inference = llmInference ?: throw IllegalStateException("Model not initialized")
// Generate response asynchronously to keep the flow non-blocking
inference.generateResponseAsync(prompt) { partialResult, done ->
trySend(partialResult)
if (done) {
channel.close()
}
}
awaitClose { /* Cleanup resources if necessary */ }
}.flowOn(Dispatchers.Default)
}
The ViewModel: Orchestrating State
The ViewModel acts as the bridge. It takes user input, updates the UI to show the user's message, and then manages the stream coming back from the AI.
@HiltViewModel
class ChatViewModel @Inject constructor(
private val repository: OnDeviceChatRepository
) : ViewModel() {
private val _uiState = MutableStateFlow(ChatUiState())
val uiState: StateFlow<ChatUiState> = _uiState.asStateFlow()
fun sendMessage(userText: String) {
if (userText.isBlank()) return
// 1. Add user message to the list
val userMsg = ChatMessage(userText, isUser = true)
_uiState.update { it.copy(messages = it.messages + userMsg, isTyping = true) }
viewModelScope.launch {
var fullAiResponse = ""
// 2. Collect the stream from the repository
repository.generateResponseStream(userText)
.onStart {
// Add an empty placeholder for the AI response
_uiState.update { it.copy(messages = it.messages + ChatMessage("", isUser = false)) }
}
.collect { token ->
fullAiResponse += token
// 3. Update the last message in the list with the new token
_uiState.update { state ->
val updatedMessages = state.messages.toMutableList()
val lastIdx = updatedMessages.lastIndex
updatedMessages[lastIdx] = updatedMessages[lastIdx].copy(text = fullAiResponse)
state.copy(messages = updatedMessages)
}
}
_uiState.update { it.copy(isTyping = false) }
}
}
}
The UI Layer: Jetpack Compose Chat Screen
In Compose, we use LazyColumn to render the messages. A key trick here is using LaunchedEffect to auto-scroll to the bottom as the AI "types."
@Composable
fun ChatScreen(viewModel: ChatViewModel) {
val uiState by viewModel.uiState.collectAsStateWithLifecycle()
var inputText by remember { mutableStateOf("") }
val listState = rememberLazyListState()
// Auto-scroll logic
LaunchedEffect(uiState.messages.size, uiState.messages.lastOrNull()?.text) {
if (uiState.messages.isNotEmpty()) {
listState.animateScrollToItem(uiState.messages.size - 1)
}
}
Column(modifier = Modifier.fillMaxSize().padding(16.dp)) {
LazyColumn(
state = listState,
modifier = Modifier.weight(1f).fillMaxWidth(),
verticalArrangement = Arrangement.spacedBy(8.dp)
) {
items(uiState.messages) { message ->
ChatBubble(message)
}
}
Row(verticalAlignment = Alignment.CenterVertically) {
TextField(
value = inputText,
onValueChange = { inputText = it },
modifier = Modifier.weight(1f),
placeholder = { Text("Ask Gemini Nano...") }
)
IconButton(onClick = {
viewModel.sendMessage(inputText)
inputText = ""
}) {
Icon(Icons.Default.Send, contentDescription = "Send")
}
}
}
}
Performance Pitfalls to Avoid
Building for on-device AI requires a higher level of discipline than standard app development. Here are the most common pitfalls:
- Main Thread Inference: Never, ever call the AI model on the Main thread. Even a small model will block the UI for hundreds of milliseconds, leading to "Application Not Responding" (ANR) errors.
- Memory Management: Local LLMs are heavy. If you are not using AICore and are instead bundling your own TFLite model, you must manually close the
InterpreterorLlmInferenceinstance in the ViewModel'sonCleared()method to prevent massive native memory leaks. - Ignoring Lifecycle: Use
collectAsStateWithLifecycle(). If the user moves the app to the background, you want the UI collection to pause to save battery, even if the AI continues to process the current prompt in the background. - Over-Recomposition: When streaming tokens, the state updates rapidly. Ensure your
ChatBubblecomposables are optimized and userememberfor any expensive UI calculations to keep the frame rate smooth.
Conclusion: The New Frontier
Creating a Chat UI with Jetpack Compose for Gemini Nano is more than just a UI task; it's a lesson in modern systems architecture. By leveraging AICore, we move away from the "Cloud-First" mentality and toward a "Privacy-First, Latency-Zero" future.
The combination of Kotlin's reactive streams and Compose's declarative UI provides the perfect foundation for this new era of mobile computing. As on-device NPUs continue to evolve, the gap between what a phone can do and what a server can do will continue to shrink.
Let's Discuss
- Given the memory constraints of mobile devices, do you think AICore's shared model approach is the right move, or should developers have the freedom to bundle custom, fine-tuned models despite the storage cost?
- How do you see the role of the "Mobile Developer" changing as prompt engineering and local inference become standard parts of the Android SDK?
The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook
On-Device GenAI with Android Kotlin: Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models. You can find it here: Leanpub.com
Check also all the other programming & AI ebooks with python, typescript, c#, swift, kotlin: Leanpub.com
Android Kotlin & AI Masterclass:
Book 1: On-Device GenAI. Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models.
Book 2: Edge AI Performance. Optimizing hardware acceleration via NPU (Neural Processing Unit), GPU, and DSP. Advanced quantization and model pruning.
Book 3: Android AI Agents. Building autonomous apps that use Tool Calling, Function Injection, and Screen Awareness to perform tasks for the user.
Top comments (0)