Mart Schweiger

Posted on May 12 • Originally published at assemblyai.com

How to add automatic LLM fallbacks to your voice pipeline

#ai #llm #python #tutorial

Your voice agent is mid-conversation when Anthropic's API returns a 529 overloaded error. The user is waiting. Your code throws. The call drops.

This is the failure mode most voice pipelines aren't built for—and it's getting worse, not better. As more applications move to a single LLM provider, a regional outage at any one of them stalls every downstream voice agent that depends on it. The fix isn't more retries on the same model; it's an automatic switch to a different one.

This tutorial walks you through adding automatic LLM fallbacks to a voice pipeline using AssemblyAI's LLM Gateway. With one extra parameter in your request, the Gateway will automatically retry failed calls on a backup model—Claude to Gemini to GPT—without you writing a line of retry logic. By the end, you'll have a runnable Python pipeline that transcribes live audio with Universal-3 Pro Streaming, routes the transcript through a primary LLM with a fallback chain, and stays online when any single provider does not.

Why fallbacks matter more for voice than for chat

In a chat app, an LLM error means a spinner and a retry button. In a Voice AI pipeline, it means dead air. The user is on the phone, waiting for a response, and a five-second silence while you reconnect to a different provider already feels like a hang-up.

Three failure modes that fallbacks solve:

Provider rate limits. OpenAI, Anthropic, and Google all enforce per-account TPM (tokens per minute) ceilings. A traffic spike on a Monday morning sales line can blow through your default tier before lunch.
Regional outages. Provider status pages show a real distribution of multi-hour incidents per quarter. If your only LLM call is to a single model, your uptime is capped at theirs.
Model deprecations. A model gets sunset on short notice. Without a fallback configured, every voice session that hits the deprecated model fails until you ship a code change.

LLM Gateway sits in front of every supported provider. You point your client at one endpoint, specify a primary model, and list one or two fallbacks. When the primary fails—overloaded, rate-limited, or unavailable—the Gateway transparently retries on the next model in line and returns the response as if nothing went wrong.

What you'll build

A Python voice pipeline that:

Streams microphone audio to AssemblyAI's Universal-3 Pro streaming speech-to-text model
On end-of-turn, sends the final transcript to LLM Gateway with kimi-k2.5 as the primary model and claude-sonnet-4-6 as the fallback
Prints the agent's response—and logs which model actually handled the call

You'll also see how to chain multiple fallbacks, override prompts per fallback model, and tune retry behavior.

Stack:

AssemblyAI Universal-3 Pro Streaming (speech-to-text)
AssemblyAI LLM Gateway (LLM routing with fallbacks)
Python 3.9+

Setup

Install the dependencies:

pip install assemblyai requests python-dotenv pyaudio

Create a .env file with your API key:

ASSEMBLYAI_API_KEY=your_key_here

You only need one key. The same key authenticates both the streaming STT WebSocket and the LLM Gateway endpoint—no separate accounts with OpenAI, Anthropic, or Google required.

Step 1: Connect to Universal-3 Pro Streaming

For a voice agent, you want the lowest-latency path from speech to text, then immediately hand the transcript to the LLM. We'll use AssemblyAI's v3 streaming API, which returns immutable final transcripts in roughly 300ms.

import os
from dotenv import load_dotenv
from assemblyai.streaming.v3 import (
    StreamingClient,
    StreamingClientOptions,
    StreamingParameters,
    StreamingEvents,
    BeginEvent,
    TurnEvent,
    TerminationEvent,
    StreamingError,
)

load_dotenv()
ASSEMBLYAI_API_KEY = os.getenv("ASSEMBLYAI_API_KEY")

def on_begin(client: StreamingClient, event: BeginEvent):
    print(f"Session started: {event.id}")

def on_turn(client: StreamingClient, event: TurnEvent):
    if event.end_of_turn:
        print(f"\nUser: {event.transcript}")
        respond_with_fallback(event.transcript)
    else:
        print(f"\rPartial: {event.transcript}", end="")

def on_error(client: StreamingClient, error: StreamingError):
    print(f"STT error: {error}")

def on_terminated(client: StreamingClient, event: TerminationEvent):
    print("Session terminated")

The on_turn handler is where the LLM call happens. Every time the user finishes speaking, we hand the final transcript to respond_with_fallback—the function we're about to define.

Step 2: Add the fallback chain

Here's the part that matters. A standard chat completions request looks like this:

def respond_with_fallback(user_text: str):
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers={"authorization": ASSEMBLYAI_API_KEY},
        json={
            "model": "kimi-k2.5",  # Primary: fast, low-latency
            "messages": [
                {"role": "system", "content": "You are a helpful voice assistant. Keep responses to one or two short sentences."},
                {"role": "user", "content": user_text},
            ],
            "max_tokens": 200,
            "fallbacks": [
                {"model": "claude-sonnet-4-6"},   # First fallback
                {"model": "gemini-2.5-flash"},    # Second fallback
            ],
            "fallback_config": {"depth": 2},      # Try up to two fallbacks
        },
        timeout=10,
    )

    if response.status_code != 200:
        print(f"All models failed: {response.text}")
        return

    result = response.json()
    actual_model = result.get("model")
    reply = result["choices"][0]["message"]["content"]

    print(f"Agent ({actual_model}): {reply}")
    return reply

A few details that matter for production:

The model field in the response reflects which model actually answered. If your primary failed and the Gateway used Claude instead, you'll see claude-sonnet-4-6 in the response—and you'll only be billed for that model.
Without a fallbacks array, the Gateway still does one automatic retry on the primary after 500ms (default fallback_config.retry: true). That handles transient blips. The fallback array handles outright failures.
fallback_config.depth controls how many fallbacks to try. Setting it to 2 means the Gateway will try the primary, then the first fallback, then the second.

Step 3: Choose the right primary and fallback models

Latency and capability vary widely across providers. For voice, you want a fast primary because the user is waiting in real time, and a more reliable secondary in case the fast one is overloaded.

Pulled from the LLM Gateway model list, here are sensible voice agent pairings:

Use case	Primary	Fallback 1	Fallback 2	Why
Latency-critical (phone agent)	kimi-k2.5 (~1.2s)	gemini-2.5-flash-lite (~1.1s)	gpt-5-nano (~3.2s)	All low latency; different providers
Quality-first (clinical, legal)	claude-sonnet-4-6	gemini-2.5-pro	gpt-5.1	Highest quality models in each provider
Balanced (most consumer apps)	gpt-5.2 (~1.6s)	claude-haiku-4-5-20251001 (~4.1s)	kimi-k2.5	Speed + cross-provider redundancy

The key constraint: your fallbacks should be on different providers from the primary. A Claude Sonnet to Claude Haiku fallback won't help during an Anthropic outage—both calls hit the same upstream.

Step 4: Override fields per fallback

Sometimes a fallback model needs a different prompt. Maybe your primary uses a 4,000-token system prompt that your cheaper fallback doesn't have the context window for. Or you want the fallback to be more concise to keep latency in check.

LLM Gateway lets you override any request field per fallback:

"fallbacks": [
    {
        "model": "claude-sonnet-4-6",
        "messages": [
            {"role": "system", "content": "Be very concise. One sentence max."},
            {"role": "user", "content": user_text},
        ],
        "max_tokens": 80,
    },
],

Any field you don't override is inherited from the original request. This is especially useful when your primary is tuned with a long, detailed system prompt and you want a stripped-down version on the backup.

Step 5: Putting it all together

Wire the streaming client to your fallback-enabled response function:

import assemblyai as aai

def main():
    client = StreamingClient(
        StreamingClientOptions(
            api_key=ASSEMBLYAI_API_KEY,
            api_host="streaming.assemblyai.com",
        )
    )
    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)

    client.connect(
        StreamingParameters(
            speech_model="u3-rt-pro",
            sample_rate=16000,
        )
    )

    try:
        client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
    finally:
        client.disconnect(terminate=True)

if __name__ == "__main__":
    main()

Run it, speak into your microphone, and watch the printed model name. Then—if you want to see the fallback fire—set the primary model parameter to a deliberately invalid string like "this-model-does-not-exist". The Gateway will fail the primary, immediately route to your first fallback, and return a normal response with the fallback model name in the output.

What this gets you in production

Three changes to your voice pipeline as soon as fallbacks are in place:

Provider outages stop being your incidents. When Anthropic, OpenAI, or Google has a regional issue, your voice sessions keep flowing—they just route through whichever provider is healthy. You don't get paged.
Rate-limit spikes self-heal. A traffic spike that would have hit your TPM ceiling on the primary now spreads across providers automatically.
Model migrations are zero-downtime. When a new model ships, you can flip the primary to the new model and keep the old one as a fallback. If anything goes wrong, traffic falls back automatically while you debug.

You can layer more on top of this—separate fallback chains per use case, EU-resident endpoints for GDPR compliance, prompt caching to amortize cost—but the single fallbacks array gets you 90% of the resilience for two extra lines of JSON.

What to build next

Pair fallbacks with streaming chat completions so the user hears the first sentence while the LLM is still generating the rest.
Add tool calling to let your voice agent look up orders, schedule callbacks, or transfer to a human—same fallback behavior carries through.
Consolidate to one API. If you're managing this on top of a separate STT provider, AssemblyAI's Voice Agent API bundles speech understanding, LLM reasoning, and voice generation into a single WebSocket—same fallback patterns apply at the LLM layer, and there's nothing to wire together.

Voice agents need to be built for the failures that will actually happen, not the happy path. Fallbacks turn LLM availability from a single point of failure into a non-event.

Frequently asked questions

What is an LLM fallback and why does my voice pipeline need one?

An LLM fallback is a backup model that automatically takes over when your primary model fails—whether from a provider outage, rate limit, or transient error. Voice pipelines need fallbacks because a failed LLM call means dead air on a live call, which is much worse than a failed text request that the user can retry. With AssemblyAI's LLM Gateway, you specify a fallbacks array in your request and the Gateway transparently retries on the next model if the primary fails—no custom retry logic required.

How does AssemblyAI's LLM Gateway handle automatic LLM failover?

LLM Gateway accepts a fallbacks array of up to two backup models per request. If the primary model fails, the Gateway automatically retries the request with the first fallback, then the second, until one succeeds. The response payload reflects the model that actually answered, and you're billed only for that model. By default, the Gateway also performs one automatic retry on the primary after 500 ms to handle transient errors before falling back to a different provider.

Which LLM providers does AssemblyAI's LLM Gateway support for fallback chains?

LLM Gateway supports 25+ models across Anthropic Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5), OpenAI GPT (GPT-5.2, 5.1, 5, 4.1, mini, nano, gpt-oss), Google Gemini (3 Flash Preview, 2.5 Pro/Flash/Flash-Lite), Alibaba Cloud Qwen, and Moonshot AI Kimi. For voice fallback chains, the key constraint is to chain across different providers—a Claude to Claude fallback won't help during an Anthropic outage because both calls hit the same upstream.

How do I add automatic LLM fallbacks to my voice pipeline?

Add a fallbacks array to your chat/completions request body—that's it. The Gateway handles retries, model switching, and billing automatically. A typical voice agent pairing is kimi-k2.5 as the primary (~1.2s latency), claude-sonnet-4-6 as the first fallback for higher quality, and gemini-2.5-flash-lite as a second fallback for additional provider redundancy. Set fallback_config.depth: 2 to use both backups.

Can I customize the prompt or temperature for each fallback model?

Yes—LLM Gateway lets you override any request field per fallback. This is useful when your primary uses a long, detailed system prompt that a smaller fallback can't accommodate, or when you want the fallback to be more concise to keep latency in check. Any field you don't override on the fallback is inherited from the original request, so you only need to specify what changes.

How does billing work when an LLM Gateway request falls back to a different model?

You're charged only for the model that actually returned the response, at that model's per-token rate. If your primary fails and the Gateway retries with a fallback, you pay only for the fallback model's tokens—not for the failed primary attempt. All usage shows up on a single AssemblyAI invoice across providers, with no markup on top of model rates.

What's the difference between LLM Gateway fallbacks and writing my own retry logic?

LLM Gateway fallbacks handle the entire retry-and-route flow inside the Gateway, so your application code makes one request and gets one response—no custom timeout handling, no model-switching logic, no per-provider error mapping. Writing it yourself works for chat apps where a few seconds of retry latency is fine, but in a voice pipeline every second of dead air costs you, and built-in fallbacks fire faster than client-side retries because the Gateway is already inside the network path.

DEV Community