Realtime Models vs STT-LLM-TTS Pipeline: Choosing the Right Architecture for Voice AI
A comprehensive comparison of speech-to-speech realtime models versus the traditional STT→LLM→TTS pipeline for building voice AI applications with LiveKit.
Building voice AI applications requires a fundamental architectural decision: should you use a realtime speech-to-speech model or the traditional STT→LLM→TTS pipeline? Each approach has distinct trade-offs in latency, cost, and programmatic control.
This guide breaks down both architectures to help you choose the right approach for your use case.
Table of Contents
- The Two Architectures
- Latency Comparison
- Cost Analysis
- Programmatic Limitations
- Customizability Deep Dive
- The Hybrid Approach
- Decision Framework
- Implementation Examples
- Conclusion
The Two Architectures
Realtime Models (Speech-to-Speech)
Realtime models consume and produce speech directly, bypassing intermediate text conversion. A single model handles the entire conversation flow.
flowchart LR
subgraph User
A[🎤 Speech Input]
F[🔊 Speech Output]
end
subgraph Agent["Realtime Model"]
B[Audio In]
C[["🧠 Single Model\n(Speech-to-Speech)"]]
D[Audio Out]
end
A --> B
B --> C
C --> D
D --> F
style C fill:#e1f5fe,stroke:#01579b
style Agent fill:#f5f5f5,stroke:#333
Available options:
- OpenAI Realtime API
- Google Gemini Live API
- xAI Grok Voice Agent API
- Amazon Nova Sonic
- Ultravox
STT→LLM→TTS Pipeline
The pipeline approach chains three specialized models together:
- STT (Speech-to-Text): Transcribes user audio to text
- LLM (Large Language Model): Generates a response
- TTS (Text-to-Speech): Synthesizes the response as audio
flowchart LR
subgraph User
A[🎤 Speech Input]
G[🔊 Speech Output]
end
subgraph Agent["Voice Pipeline"]
B[Audio In]
C[["🎯 STT\n(AssemblyAI, Deepgram, etc.)"]]
D[["🧠 LLM\n(GPT-4, Claude, Gemini, etc.)"]]
E[["🗣️ TTS\n(Cartesia, ElevenLabs, etc.)"]]
F[Audio Out]
end
A --> B
B --> C
C -->|"Text"| D
D -->|"Text"| E
E --> F
F --> G
style C fill:#fff3e0,stroke:#e65100
style D fill:#e1f5fe,stroke:#01579b
style E fill:#f3e5f5,stroke:#7b1fa2
style Agent fill:#f5f5f5,stroke:#333
This modular architecture lets you mix and match providers for each component.
Architecture Comparison at a Glance
flowchart TB
subgraph Realtime["🚀 Realtime Model"]
direction LR
R1[Audio] --> R2[Single Model] --> R3[Audio]
end
subgraph Pipeline["🔧 STT→LLM→TTS Pipeline"]
direction LR
P1[Audio] --> P2[STT] --> P3[LLM] --> P4[TTS] --> P5[Audio]
end
subgraph Hybrid["⚡ Hybrid Approach"]
direction LR
H1[Audio] --> H2["Realtime Model\n(text mode)"] --> H3[TTS] --> H4[Audio]
end
style Realtime fill:#e8f5e9,stroke:#2e7d32
style Pipeline fill:#fff3e0,stroke:#ef6c00
style Hybrid fill:#e3f2fd,stroke:#1565c0
Latency Comparison
gantt
title Response Latency Comparison
dateFormat X
axisFormat %L ms
section Realtime
Audio Processing + Response :0, 200
section Pipeline
STT Processing :0, 100
LLM Inference :100, 250
TTS Synthesis :250, 350
section Hybrid
Realtime (text mode) :0, 180
TTS Synthesis :180, 280
Illustrative timing—actual latency varies by provider and configuration.
Realtime Models: Lower End-to-End Latency
Realtime models process audio directly without intermediate text conversion, eliminating:
- Serialization/deserialization overhead between models
- Multiple network round trips
- Text tokenization delays
Built-in turn detection runs server-side, reducing latency further.
Pipeline: Higher Latency, More Optimization Options
The pipeline introduces latency at each stage, but offers several mitigations:
1
2
3
4
5
6
7
# Enable preemptive generation to start responding before turn ends
session = AgentSession(
preemptive_generation=True,
stt="assemblyai/universal-streaming:en",
llm="openai/gpt-4.1-mini",
tts="cartesia/sonic-3",
)
Other optimizations include:
- Turn detector model for context-aware end-of-turn detection
- Streaming at each stage to reduce perceived latency
- Provider selection based on latency characteristics
Cost Analysis
Realtime Models: Premium Pricing
Realtime models typically carry premium per-minute pricing. Additionally, if you need LiveKit’s turn detection model (for more natural conversation flow), you must add a separate STT plugin—incurring extra cost.
Pipeline: Granular Cost Control
The pipeline approach lets you optimize costs by selecting providers for each component:
| Component | Budget Option | Premium Option |
|---|---|---|
| STT | AssemblyAI ($0.0025/min) | ElevenLabs Scribe ($0.0105/min) |
| LLM | GPT-4o mini ($0.15/M input tokens) | GPT-5.2 ($1.75/M input tokens) |
| TTS | Inworld ($10/M chars) | ElevenLabs Multilingual ($300/M chars) |
A budget-conscious deployment might use:
- AssemblyAI Universal-Streaming for STT
- GPT-4o mini for the LLM
- Deepgram Aura-1 for TTS
Programmatic Limitations
Realtime Model Constraints
Realtime models come with several important limitations:
No Interim Transcription
Realtime models don’t provide interim transcription results. User input transcriptions are often delayed and may arrive after the agent’s response. If you need real-time transcripts (for UI display or logging), you’ll need to add a separate STT plugin.
No Scripted Speech
The session.say() method requires a TTS plugin. With realtime models, you must use generate_reply():
1
2
3
4
5
6
7
# This won't work with realtime models alone
await session.say("Welcome to our service!")
# Instead, use generate_reply with instructions
session.generate_reply(
instructions="Greet the user by saying exactly: Welcome to our service!"
)
The output isn’t guaranteed to match your script exactly.
Conversation History Issues
Current realtime models only support loading conversation history in text format. This limits their ability to interpret emotional context from previous exchanges. With OpenAI’s Realtime API, loading extensive history can cause the model to respond in text-only mode, even when configured for audio.
Limited Customization Points
Realtime models expose only one audio processing node:
1
2
3
4
5
6
async def realtime_audio_output_node(
self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> AsyncIterable[rtc.AudioFrame]:
# Adjust output audio before publishing
async for frame in Agent.default.realtime_audio_output_node(self, audio, model_settings):
yield frame
Pipeline: Full Programmatic Control
The pipeline architecture exposes three customizable nodes:
flowchart LR
subgraph Customization["Customizable Pipeline Nodes"]
A[Audio In] --> B
subgraph STT["stt_node()"]
B[Transcribe Audio]
end
B -->|Text| C
subgraph LLM["llm_node()"]
C[Generate Response]
end
C -->|Text| D
subgraph TTS["tts_node()"]
D[Synthesize Speech]
end
D --> E[Audio Out]
end
B -.->|"• Filter filler words\n• Noise reduction\n• Custom STT provider"| B
C -.->|"• RAG injection\n• Prompt modification\n• Structured output"| C
D -.->|"• Pronunciation rules\n• SSML processing\n• Volume control"| D
style STT fill:#fff3e0,stroke:#e65100
style LLM fill:#e1f5fe,stroke:#01579b
style TTS fill:#f3e5f5,stroke:#7b1fa2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class MyAgent(Agent):
async def stt_node(self, audio, model_settings):
"""Transcribe input audio to text"""
async for event in Agent.default.stt_node(self, audio, model_settings):
# Filter filler words, apply noise reduction, etc.
yield event
async def llm_node(self, chat_ctx, tools, model_settings):
"""Generate response with full context control"""
# Inject RAG context, modify prompts, handle structured output
async for chunk in Agent.default.llm_node(self, chat_ctx, tools, model_settings):
yield chunk
async def tts_node(self, text, model_settings):
"""Synthesize speech with custom pronunciation"""
async def adjust_pronunciation(input_text):
async for chunk in input_text:
# Apply pronunciation rules, SSML, etc.
yield chunk.replace("API", "A P I")
async for frame in Agent.default.tts_node(self, adjust_pronunciation(text), model_settings):
yield frame
Scripted Speech Support
The pipeline fully supports exact scripted output:
1
2
3
4
5
# Speak exact text with optional pre-synthesized audio
await session.say(
"Your order has been confirmed. Order number: 12345.",
allow_interruptions=False,
)
Structured LLM Output
Pipeline agents can use structured output for TTS style control:
1
2
3
4
5
class ResponseEmotion(TypedDict):
voice_instructions: str # "Speak warmly and enthusiastically"
response: str # The actual spoken text
# LLM returns structured JSON, TTS applies voice instructions
Customizability Deep Dive
Where Pipeline Wins
| Capability | Description |
|---|---|
| Voice selection | Pair any TTS voice with any LLM |
| Provider flexibility | Swap STT, LLM, or TTS independently |
| Transcript manipulation | Filter/modify STT output before LLM processing |
| Response filtering | Modify LLM output before speech synthesis |
| Exact scripted speech | Use session.say() for compliance-critical output |
| Cost optimization | Mix budget and premium providers strategically |
| Real-time transcripts | Display live transcription in your UI |
Where Realtime Wins
| Capability | Description |
|---|---|
| Emotional understanding | Better comprehension of tone, hesitation, verbal cues |
| Expressive output | More natural speech with emotional context |
| Simpler configuration | Single model setup |
| Native video input | Gemini Live supports true video understanding |
| Lower baseline latency | No inter-model communication overhead |
The Hybrid Approach
You can combine realtime speech comprehension with a separate TTS for output control:
flowchart LR
subgraph User
A[🎤 Speech Input]
F[🔊 Speech Output]
end
subgraph Agent["Hybrid Architecture"]
B[Audio In]
C[["🧠 Realtime Model\n(text output mode)"]]
D[["🗣️ TTS\n(Your choice)"]]
E[Audio Out]
end
A --> B
B -->|"Audio"| C
C -->|"Text"| D
D --> E
E --> F
style C fill:#e1f5fe,stroke:#01579b
style D fill:#f3e5f5,stroke:#7b1fa2
style Agent fill:#e8f5e9,stroke:#2e7d32
1
2
3
4
5
6
7
8
from livekit.plugins import openai
session = AgentSession(
# Realtime model for speech understanding (text output only)
llm=openai.realtime.RealtimeModel(modalities=["text"]),
# Your preferred TTS for speech output
tts="cartesia/sonic-3",
)
This “half-cascade” architecture provides:
- Realtime speech comprehension with emotional understanding
- Full control over output voice and pronunciation
- Support for
session.say()with exact scripts - Workaround for conversation history issues
Decision Framework
Choose Realtime Models When:
- Emotional intelligence matters — Customer service, therapy, coaching applications
- Rapid prototyping — Get something working quickly with minimal configuration
- Video + voice integration — Using Gemini Live for multimodal input
- Latency is critical — Every millisecond counts
Choose Pipeline When:
- Compliance requirements — Need exact scripted disclosures or confirmations
- Budget constraints — Want to optimize cost per component
- Real-time transcription — UI needs to display what the user is saying
- Custom pronunciation — Industry-specific terminology, brand names
- Provider flexibility — Want to switch components without rewriting code
- Conversation history — Loading and continuing previous sessions
Choose Hybrid When:
- Best of both worlds — Realtime understanding with TTS voice control
- Specific voice requirements — Need a particular TTS provider’s voices
- Gradual migration — Moving from pipeline to realtime incrementally
Implementation Examples
Basic Pipeline Setup
1
2
3
4
5
6
7
from livekit.agents import AgentSession
session = AgentSession(
stt="assemblyai/universal-streaming:en",
llm="openai/gpt-4.1-mini",
tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
)
Basic Realtime Setup
1
2
3
4
5
6
from livekit.agents import AgentSession
from livekit.plugins import openai
session = AgentSession(
llm=openai.realtime.RealtimeModel(voice="alloy"),
)
Hybrid Setup
1
2
3
4
5
6
7
from livekit.agents import AgentSession
from livekit.plugins import openai
session = AgentSession(
llm=openai.realtime.RealtimeModel(modalities=["text"]),
tts="elevenlabs/flash-v2.5:voice-id",
)
Conclusion
There’s no universally “better” architecture—the right choice depends on your specific requirements:
| Priority | Recommendation |
|---|---|
| Maximum control | Pipeline |
| Lowest latency | Realtime |
| Lowest cost | Pipeline (budget providers) |
| Emotional expressiveness | Realtime or Hybrid |
| Compliance/scripted output | Pipeline |
| Simplest setup | Realtime |
For production applications requiring full control, the pipeline remains the most flexible choice. For emotionally intelligent interactions where exact output control is less critical, realtime models offer a compelling experience with simpler implementation.
The hybrid approach offers an interesting middle ground, combining realtime comprehension with TTS output control—worth considering if you’re on the fence.
This analysis is based on the LiveKit Agents documentation as of January 2026. Model capabilities and pricing evolve rapidly—always check the latest documentation for current information.