Realtime Models vs STT-LLM-TTS Pipeline: Choosing the Right Architecture for Voice AI

A comprehensive comparison of speech-to-speech realtime models versus the traditional STT→LLM→TTS pipeline for building voice AI applications with LiveKit.

Posted Jan 15, 2026

By Michael Masters & Claude

9 min read

Building voice AI applications requires a fundamental architectural decision: should you use a realtime speech-to-speech model or the traditional STT→LLM→TTS pipeline? Each approach has distinct trade-offs in latency, cost, and programmatic control.

This guide breaks down both architectures to help you choose the right approach for your use case.

The Two Architectures
Latency Comparison
- Realtime Models: Lower End-to-End Latency
- Pipeline: Higher Latency, More Optimization Options
Cost Analysis
- Realtime Models: Premium Pricing
- Pipeline: Granular Cost Control
Programmatic Limitations
- Realtime Model Constraints
- Pipeline: Full Programmatic Control
  - Scripted Speech Support
  - Structured LLM Output
Customizability Deep Dive
- Where Pipeline Wins
- Where Realtime Wins
The Hybrid Approach
Decision Framework
Implementation Examples
Conclusion

The Two Architectures

Realtime Models (Speech-to-Speech)

Realtime models consume and produce speech directly, bypassing intermediate text conversion. A single model handles the entire conversation flow.

flowchart LR
    subgraph User
        A[🎤 Speech Input]
        F[🔊 Speech Output]
    end

    subgraph Agent["Realtime Model"]
        B[Audio In]
        C[["🧠 Single Model\n(Speech-to-Speech)"]]
        D[Audio Out]
    end

    A --> B
    B --> C
    C --> D
    D --> F

    style C fill:#e1f5fe,stroke:#01579b
    style Agent fill:#f5f5f5,stroke:#333

Available options:

OpenAI Realtime API
Google Gemini Live API
xAI Grok Voice Agent API
Amazon Nova Sonic
Ultravox

STT→LLM→TTS Pipeline

The pipeline approach chains three specialized models together:

STT (Speech-to-Text): Transcribes user audio to text
LLM (Large Language Model): Generates a response
TTS (Text-to-Speech): Synthesizes the response as audio

flowchart LR
    subgraph User
        A[🎤 Speech Input]
        G[🔊 Speech Output]
    end

    subgraph Agent["Voice Pipeline"]
        B[Audio In]
        C[["🎯 STT\n(AssemblyAI, Deepgram, etc.)"]]
        D[["🧠 LLM\n(GPT-4, Claude, Gemini, etc.)"]]
        E[["🗣️ TTS\n(Cartesia, ElevenLabs, etc.)"]]
        F[Audio Out]
    end

    A --> B
    B --> C
    C -->|"Text"| D
    D -->|"Text"| E
    E --> F
    F --> G

    style C fill:#fff3e0,stroke:#e65100
    style D fill:#e1f5fe,stroke:#01579b
    style E fill:#f3e5f5,stroke:#7b1fa2
    style Agent fill:#f5f5f5,stroke:#333

This modular architecture lets you mix and match providers for each component.

Architecture Comparison at a Glance

flowchart TB
    subgraph Realtime["🚀 Realtime Model"]
        direction LR
        R1[Audio] --> R2[Single Model] --> R3[Audio]
    end

    subgraph Pipeline["🔧 STT→LLM→TTS Pipeline"]
        direction LR
        P1[Audio] --> P2[STT] --> P3[LLM] --> P4[TTS] --> P5[Audio]
    end

    subgraph Hybrid["⚡ Hybrid Approach"]
        direction LR
        H1[Audio] --> H2["Realtime Model\n(text mode)"] --> H3[TTS] --> H4[Audio]
    end

    style Realtime fill:#e8f5e9,stroke:#2e7d32
    style Pipeline fill:#fff3e0,stroke:#ef6c00
    style Hybrid fill:#e3f2fd,stroke:#1565c0

Latency Comparison

gantt
    title Response Latency Comparison
    dateFormat X
    axisFormat %L ms

    section Realtime
    Audio Processing + Response    :0, 200

    section Pipeline
    STT Processing                 :0, 100
    LLM Inference                  :100, 250
    TTS Synthesis                  :250, 350

    section Hybrid
    Realtime (text mode)           :0, 180
    TTS Synthesis                  :180, 280

Illustrative timing—actual latency varies by provider and configuration.

Realtime Models: Lower End-to-End Latency

Realtime models process audio directly without intermediate text conversion, eliminating:

Serialization/deserialization overhead between models
Multiple network round trips
Text tokenization delays

Built-in turn detection runs server-side, reducing latency further.

Pipeline: Higher Latency, More Optimization Options

The pipeline introduces latency at each stage, but offers several mitigations:

        
      
# Enable preemptive generation to start responding before turn ends
session = AgentSession(
    preemptive_generation=True,
    stt="assemblyai/universal-streaming:en",
    llm="openai/gpt-4.1-mini",
    tts="cartesia/sonic-3",
)

Other optimizations include:

Turn detector model for context-aware end-of-turn detection
Streaming at each stage to reduce perceived latency
Provider selection based on latency characteristics

Cost Analysis

Realtime Models: Premium Pricing

Realtime models typically carry premium per-minute pricing. Additionally, if you need LiveKit’s turn detection model (for more natural conversation flow), you must add a separate STT plugin—incurring extra cost.

Pipeline: Granular Cost Control

The pipeline approach lets you optimize costs by selecting providers for each component:

Component	Budget Option	Premium Option
STT	AssemblyAI ($0.0025/min)	ElevenLabs Scribe ($0.0105/min)
LLM	GPT-4o mini ($0.15/M input tokens)	GPT-5.2 ($1.75/M input tokens)
TTS	Inworld ($10/M chars)	ElevenLabs Multilingual ($300/M chars)

A budget-conscious deployment might use:

AssemblyAI Universal-Streaming for STT
GPT-4o mini for the LLM
Deepgram Aura-1 for TTS

Programmatic Limitations

Realtime Model Constraints

Realtime models come with several important limitations:

No Interim Transcription

Realtime models don’t provide interim transcription results. User input transcriptions are often delayed and may arrive after the agent’s response. If you need real-time transcripts (for UI display or logging), you’ll need to add a separate STT plugin.

No Scripted Speech

The session.say() method requires a TTS plugin. With realtime models, you must use generate_reply():

        
      
# This won't work with realtime models alone
await session.say("Welcome to our service!")

# Instead, use generate_reply with instructions
session.generate_reply(
    instructions="Greet the user by saying exactly: Welcome to our service!"
)

The output isn’t guaranteed to match your script exactly.

Conversation History Issues

Current realtime models only support loading conversation history in text format. This limits their ability to interpret emotional context from previous exchanges. With OpenAI’s Realtime API, loading extensive history can cause the model to respond in text-only mode, even when configured for audio.

Limited Customization Points

Realtime models expose only one audio processing node:

        
      
async def realtime_audio_output_node(
    self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> AsyncIterable[rtc.AudioFrame]:
    # Adjust output audio before publishing
    async for frame in Agent.default.realtime_audio_output_node(self, audio, model_settings):
        yield frame

Pipeline: Full Programmatic Control

The pipeline architecture exposes three customizable nodes:

flowchart LR
    subgraph Customization["Customizable Pipeline Nodes"]
        A[Audio In] --> B

        subgraph STT["stt_node()"]
            B[Transcribe Audio]
        end

        B -->|Text| C

        subgraph LLM["llm_node()"]
            C[Generate Response]
        end

        C -->|Text| D

        subgraph TTS["tts_node()"]
            D[Synthesize Speech]
        end

        D --> E[Audio Out]
    end

    B -.->|"• Filter filler words\n• Noise reduction\n• Custom STT provider"| B
    C -.->|"• RAG injection\n• Prompt modification\n• Structured output"| C
    D -.->|"• Pronunciation rules\n• SSML processing\n• Volume control"| D

    style STT fill:#fff3e0,stroke:#e65100
    style LLM fill:#e1f5fe,stroke:#01579b
    style TTS fill:#f3e5f5,stroke:#7b1fa2

        
      
class MyAgent(Agent):
    async def stt_node(self, audio, model_settings):
        """Transcribe input audio to text"""
        async for event in Agent.default.stt_node(self, audio, model_settings):
            # Filter filler words, apply noise reduction, etc.
            yield event

    async def llm_node(self, chat_ctx, tools, model_settings):
        """Generate response with full context control"""
        # Inject RAG context, modify prompts, handle structured output
        async for chunk in Agent.default.llm_node(self, chat_ctx, tools, model_settings):
            yield chunk

    async def tts_node(self, text, model_settings):
        """Synthesize speech with custom pronunciation"""
        async def adjust_pronunciation(input_text):
            async for chunk in input_text:
                # Apply pronunciation rules, SSML, etc.
                yield chunk.replace("API", "A P I")

        async for frame in Agent.default.tts_node(self, adjust_pronunciation(text), model_settings):
            yield frame

Scripted Speech Support

The pipeline fully supports exact scripted output:

        
      
# Speak exact text with optional pre-synthesized audio
await session.say(
    "Your order has been confirmed. Order number: 12345.",
    allow_interruptions=False,
)

Structured LLM Output

Pipeline agents can use structured output for TTS style control:

        
class ResponseEmotion(TypedDict):
    voice_instructions: str  # "Speak warmly and enthusiastically"
    response: str            # The actual spoken text

# LLM returns structured JSON, TTS applies voice instructions

Customizability Deep Dive

Where Pipeline Wins

Capability	Description
Voice selection	Pair any TTS voice with any LLM
Provider flexibility	Swap STT, LLM, or TTS independently
Transcript manipulation	Filter/modify STT output before LLM processing
Response filtering	Modify LLM output before speech synthesis
Exact scripted speech	Use `session.say()` for compliance-critical output
Cost optimization	Mix budget and premium providers strategically
Real-time transcripts	Display live transcription in your UI

Where Realtime Wins

Capability	Description
Emotional understanding	Better comprehension of tone, hesitation, verbal cues
Expressive output	More natural speech with emotional context
Simpler configuration	Single model setup
Native video input	Gemini Live supports true video understanding
Lower baseline latency	No inter-model communication overhead

The Hybrid Approach

You can combine realtime speech comprehension with a separate TTS for output control:

flowchart LR
    subgraph User
        A[🎤 Speech Input]
        F[🔊 Speech Output]
    end

    subgraph Agent["Hybrid Architecture"]
        B[Audio In]
        C[["🧠 Realtime Model\n(text output mode)"]]
        D[["🗣️ TTS\n(Your choice)"]]
        E[Audio Out]
    end

    A --> B
    B -->|"Audio"| C
    C -->|"Text"| D
    D --> E
    E --> F

    style C fill:#e1f5fe,stroke:#01579b
    style D fill:#f3e5f5,stroke:#7b1fa2
    style Agent fill:#e8f5e9,stroke:#2e7d32

        
      
from livekit.plugins import openai

session = AgentSession(
    # Realtime model for speech understanding (text output only)
    llm=openai.realtime.RealtimeModel(modalities=["text"]),
    # Your preferred TTS for speech output
    tts="cartesia/sonic-3",
)

This “half-cascade” architecture provides:

Realtime speech comprehension with emotional understanding
Full control over output voice and pronunciation
Support for session.say() with exact scripts
Workaround for conversation history issues

Decision Framework

Choose Realtime Models When:

Emotional intelligence matters — Customer service, therapy, coaching applications
Rapid prototyping — Get something working quickly with minimal configuration
Video + voice integration — Using Gemini Live for multimodal input
Latency is critical — Every millisecond counts

Choose Pipeline When:

Compliance requirements — Need exact scripted disclosures or confirmations
Budget constraints — Want to optimize cost per component
Real-time transcription — UI needs to display what the user is saying
Custom pronunciation — Industry-specific terminology, brand names
Provider flexibility — Want to switch components without rewriting code
Conversation history — Loading and continuing previous sessions

Choose Hybrid When:

Best of both worlds — Realtime understanding with TTS voice control
Specific voice requirements — Need a particular TTS provider’s voices
Gradual migration — Moving from pipeline to realtime incrementally

Implementation Examples

Basic Pipeline Setup

        
      
from livekit.agents import AgentSession

session = AgentSession(
    stt="assemblyai/universal-streaming:en",
    llm="openai/gpt-4.1-mini",
    tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
)

Basic Realtime Setup

        
      
from livekit.agents import AgentSession
from livekit.plugins import openai

session = AgentSession(
    llm=openai.realtime.RealtimeModel(voice="alloy"),
)

Hybrid Setup

        
      
from livekit.agents import AgentSession
from livekit.plugins import openai

session = AgentSession(
    llm=openai.realtime.RealtimeModel(modalities=["text"]),
    tts="elevenlabs/flash-v2.5:voice-id",
)

Conclusion

There’s no universally “better” architecture—the right choice depends on your specific requirements:

Priority	Recommendation
Maximum control	Pipeline
Lowest latency	Realtime
Lowest cost	Pipeline (budget providers)
Emotional expressiveness	Realtime or Hybrid
Compliance/scripted output	Pipeline
Simplest setup	Realtime

For production applications requiring full control, the pipeline remains the most flexible choice. For emotionally intelligent interactions where exact output control is less critical, realtime models offer a compelling experience with simpler implementation.

The hybrid approach offers an interesting middle ground, combining realtime comprehension with TTS output control—worth considering if you’re on the fence.

This analysis is based on the LiveKit Agents documentation as of January 2026. Model capabilities and pricing evolve rapidly—always check the latest documentation for current information.

voice-ai, livekit, architecture

livekit voice-ai openai gemini speech-to-speech stt tts llm

This post is licensed under CC BY 4.0 by the author.

Realtime Models vs STT-LLM-TTS Pipeline: Choosing the Right Architecture for Voice AI

Table of Contents

The Two Architectures

Realtime Models (Speech-to-Speech)

STT→LLM→TTS Pipeline

Architecture Comparison at a Glance

Latency Comparison

Realtime Models: Lower End-to-End Latency

Pipeline: Higher Latency, More Optimization Options

Cost Analysis

Realtime Models: Premium Pricing

Pipeline: Granular Cost Control

Programmatic Limitations

Realtime Model Constraints

No Interim Transcription

No Scripted Speech

Conversation History Issues

Limited Customization Points

Pipeline: Full Programmatic Control

Scripted Speech Support

Structured LLM Output

Customizability Deep Dive

Where Pipeline Wins

Where Realtime Wins

The Hybrid Approach

Decision Framework

Choose Realtime Models When:

Choose Pipeline When:

Choose Hybrid When:

Implementation Examples

Basic Pipeline Setup

Basic Realtime Setup

Hybrid Setup

Conclusion

Trending Tags