Sub-200ms Voice AI: The Engineering Behind Real-Time Agents

In my previous post on voice AI architecture, I laid out the full pipeline and the latency challenge. The numbers were sobering: a naive implementation easily lands above 1 second of total latency. That's not good enough. The best voice agents today target sub-200ms time-to-first-byte, meaning the user hears the beginning of the response within 200ms of finishing their sentence.

Getting there requires rethinking every component. Here's how.

The Latency Budget

Let's start with where the time actually goes in a conventional pipeline.

Latency breakdown of a sequential voice AI pipeline. Each stage adds delay before the user hears a response.

Voice Activity Detection:    10-30ms
Speech-to-Text (ASR):       50-300ms
LLM Inference:              200-800ms
Text-to-Speech (TTS):       100-400ms
Network Round-trips:         50-150ms

If you run these sequentially, you're looking at 400ms on a good day and over 1.5 seconds on a bad one. The goal isn't to make each component faster in isolation. It's to restructure the pipeline so components overlap and the user hears audio before the system has finished thinking.

Voice Activity Detection

Before you can process speech, you need to know when the user has stopped talking. This sounds trivial and it isn't. Voice Activity Detection (VAD) determines the boundary between "user is speaking" and "user is waiting for a response."

A naive approach waits for silence, typically 500-800ms of no audio. That's half your latency budget gone before you've done anything useful. Modern VAD systems use lightweight neural models trained to detect speech endpoints with as little as 200ms of trailing silence. Some architectures go further, using prosodic cues like falling intonation to predict that the user is about to stop speaking, letting the system begin processing before the utterance is fully complete.

The risk with aggressive endpoint detection is false triggers. Cutting the user off mid-sentence is worse than being slightly slow. The tuning here is empirical and depends heavily on your use case.

Streaming ASR

Batch ASR waits for the complete utterance, then transcribes it. Streaming ASR transcribes in real-time, emitting partial results as the user speaks.

With streaming ASR, your LLM can begin processing the first words of the utterance before the user has finished talking. Deepgram's streaming API emits interim results with latency under 100ms. The tradeoff is accuracy. Interim transcripts are less reliable than final transcripts, especially for technical vocabulary or proper nouns. The system needs to handle revisions gracefully when the final transcript differs from the interim.

Speculative Generation

This is the most impactful optimization and the least discussed. Speculative generation means the LLM starts generating a response based on partial input, then adjusts if the full input changes the meaning.

For many conversational patterns, the first few words of a user's utterance are enough to predict the intent. "Can you tell me about..." is almost certainly a question. "I'd like to cancel..." is almost certainly a cancellation request. The system can begin generating a response skeleton before the user finishes speaking.

When the full utterance arrives, the system either confirms the speculative response and continues streaming it, or discards it and starts fresh. In practice, the speculative path is correct often enough that the average latency drops significantly.

Response Caching

Not every response needs to be generated from scratch. Common patterns deserve cached responses. Greetings, confirmations, clarification requests, and frequently asked questions can be pre-synthesized and stored as audio. When the intent matches a cached pattern, the system plays the pre-generated audio instantly while generating a more tailored follow-up in the background.

This is similar to how human conversation works. We have stock phrases for common situations that we can produce without thinking, buying time for the more complex response that follows.

Model Distillation

The LLM is usually the largest latency contributor. Smaller models respond faster, but they're less capable. Model distillation gives you the best of both worlds: train a smaller, faster model to mimic the outputs of a larger, more capable one on your specific domain. This is the same model optimization philosophy I applied to vision models at Myelin -- different domain, same principle of trading generality for speed within a narrow use case.

For a customer service voice agent, you don't need GPT-4 level reasoning. You need fast, accurate responses within a narrow domain. A distilled model with 1-3 billion parameters, fine-tuned on your specific conversation patterns, can match the quality of a much larger model at a fraction of the inference time.

The Engineering Mindset

Real-time AI requires a fundamentally different way of thinking. In batch AI, you optimize for quality. In real-time AI, you optimize for the perception of quality within a time constraint. A slightly worse answer delivered in 200ms is better than a perfect answer delivered in 2 seconds.

Every millisecond matters. Every component gets scrutinized. Every architectural decision is filtered through the question: does this add latency, and if so, is the quality improvement worth it? That discipline, that obsession with the budget, is what separates a demo from a product.