Post

Realtime Models vs STT-LLM-TTS Pipeline: Choosing the Right Architecture for Voice AI

A comprehensive comparison of speech-to-speech realtime models versus the traditional STT→LLM→TTS pipeline for building voice AI applications with LiveKit.

Realtime Models vs STT-LLM-TTS Pipeline: Choosing the Right Architecture for Voice AI

Building voice AI applications requires a fundamental architectural decision: should you use a realtime speech-to-speech model or the traditional STT→LLM→TTS pipeline? Each approach has distinct trade-offs in latency, cost, and programmatic control.

This guide breaks down both architectures to help you choose the right approach for your use case.


Table of Contents


The Two Architectures

Realtime Models (Speech-to-Speech)

Realtime models consume and produce speech directly, bypassing intermediate text conversion. A single model handles the entire conversation flow.

flowchart LR
    subgraph User
        A[🎤 Speech Input]
        F[🔊 Speech Output]
    end

    subgraph Agent["Realtime Model"]
        B[Audio In]
        C[["🧠 Single Model\n(Speech-to-Speech)"]]
        D[Audio Out]
    end

    A --> B
    B --> C
    C --> D
    D --> F

    style C fill:#e1f5fe,stroke:#01579b
    style Agent fill:#f5f5f5,stroke:#333

Available options:

  • OpenAI Realtime API
  • Google Gemini Live API
  • xAI Grok Voice Agent API
  • Amazon Nova Sonic
  • Ultravox

STT→LLM→TTS Pipeline

The pipeline approach chains three specialized models together:

  1. STT (Speech-to-Text): Transcribes user audio to text
  2. LLM (Large Language Model): Generates a response
  3. TTS (Text-to-Speech): Synthesizes the response as audio
flowchart LR
    subgraph User
        A[🎤 Speech Input]
        G[🔊 Speech Output]
    end

    subgraph Agent["Voice Pipeline"]
        B[Audio In]
        C[["🎯 STT\n(AssemblyAI, Deepgram, etc.)"]]
        D[["🧠 LLM\n(GPT-4, Claude, Gemini, etc.)"]]
        E[["🗣️ TTS\n(Cartesia, ElevenLabs, etc.)"]]
        F[Audio Out]
    end

    A --> B
    B --> C
    C -->|"Text"| D
    D -->|"Text"| E
    E --> F
    F --> G

    style C fill:#fff3e0,stroke:#e65100
    style D fill:#e1f5fe,stroke:#01579b
    style E fill:#f3e5f5,stroke:#7b1fa2
    style Agent fill:#f5f5f5,stroke:#333

This modular architecture lets you mix and match providers for each component.

Architecture Comparison at a Glance

flowchart TB
    subgraph Realtime["🚀 Realtime Model"]
        direction LR
        R1[Audio] --> R2[Single Model] --> R3[Audio]
    end

    subgraph Pipeline["🔧 STT→LLM→TTS Pipeline"]
        direction LR
        P1[Audio] --> P2[STT] --> P3[LLM] --> P4[TTS] --> P5[Audio]
    end

    subgraph Hybrid["⚡ Hybrid Approach"]
        direction LR
        H1[Audio] --> H2["Realtime Model\n(text mode)"] --> H3[TTS] --> H4[Audio]
    end

    style Realtime fill:#e8f5e9,stroke:#2e7d32
    style Pipeline fill:#fff3e0,stroke:#ef6c00
    style Hybrid fill:#e3f2fd,stroke:#1565c0

Latency Comparison

gantt
    title Response Latency Comparison
    dateFormat X
    axisFormat %L ms

    section Realtime
    Audio Processing + Response    :0, 200

    section Pipeline
    STT Processing                 :0, 100
    LLM Inference                  :100, 250
    TTS Synthesis                  :250, 350

    section Hybrid
    Realtime (text mode)           :0, 180
    TTS Synthesis                  :180, 280

Illustrative timing—actual latency varies by provider and configuration.

Realtime Models: Lower End-to-End Latency

Realtime models process audio directly without intermediate text conversion, eliminating:

  • Serialization/deserialization overhead between models
  • Multiple network round trips
  • Text tokenization delays

Built-in turn detection runs server-side, reducing latency further.

Pipeline: Higher Latency, More Optimization Options

The pipeline introduces latency at each stage, but offers several mitigations:

1
2
3
4
5
6
7
# Enable preemptive generation to start responding before turn ends
session = AgentSession(
    preemptive_generation=True,
    stt="assemblyai/universal-streaming:en",
    llm="openai/gpt-4.1-mini",
    tts="cartesia/sonic-3",
)

Other optimizations include:

  • Turn detector model for context-aware end-of-turn detection
  • Streaming at each stage to reduce perceived latency
  • Provider selection based on latency characteristics

Cost Analysis

Realtime Models: Premium Pricing

Realtime models typically carry premium per-minute pricing. Additionally, if you need LiveKit’s turn detection model (for more natural conversation flow), you must add a separate STT plugin—incurring extra cost.

Pipeline: Granular Cost Control

The pipeline approach lets you optimize costs by selecting providers for each component:

Component Budget Option Premium Option
STT AssemblyAI ($0.0025/min) ElevenLabs Scribe ($0.0105/min)
LLM GPT-4o mini ($0.15/M input tokens) GPT-5.2 ($1.75/M input tokens)
TTS Inworld ($10/M chars) ElevenLabs Multilingual ($300/M chars)

A budget-conscious deployment might use:

  • AssemblyAI Universal-Streaming for STT
  • GPT-4o mini for the LLM
  • Deepgram Aura-1 for TTS

Programmatic Limitations

Realtime Model Constraints

Realtime models come with several important limitations:

No Interim Transcription

Realtime models don’t provide interim transcription results. User input transcriptions are often delayed and may arrive after the agent’s response. If you need real-time transcripts (for UI display or logging), you’ll need to add a separate STT plugin.

No Scripted Speech

The session.say() method requires a TTS plugin. With realtime models, you must use generate_reply():

1
2
3
4
5
6
7
# This won't work with realtime models alone
await session.say("Welcome to our service!")

# Instead, use generate_reply with instructions
session.generate_reply(
    instructions="Greet the user by saying exactly: Welcome to our service!"
)

The output isn’t guaranteed to match your script exactly.

Conversation History Issues

Current realtime models only support loading conversation history in text format. This limits their ability to interpret emotional context from previous exchanges. With OpenAI’s Realtime API, loading extensive history can cause the model to respond in text-only mode, even when configured for audio.

Limited Customization Points

Realtime models expose only one audio processing node:

1
2
3
4
5
6
async def realtime_audio_output_node(
    self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> AsyncIterable[rtc.AudioFrame]:
    # Adjust output audio before publishing
    async for frame in Agent.default.realtime_audio_output_node(self, audio, model_settings):
        yield frame

Pipeline: Full Programmatic Control

The pipeline architecture exposes three customizable nodes:

flowchart LR
    subgraph Customization["Customizable Pipeline Nodes"]
        A[Audio In] --> B

        subgraph STT["stt_node()"]
            B[Transcribe Audio]
        end

        B -->|Text| C

        subgraph LLM["llm_node()"]
            C[Generate Response]
        end

        C -->|Text| D

        subgraph TTS["tts_node()"]
            D[Synthesize Speech]
        end

        D --> E[Audio Out]
    end

    B -.->|"• Filter filler words\n• Noise reduction\n• Custom STT provider"| B
    C -.->|"• RAG injection\n• Prompt modification\n• Structured output"| C
    D -.->|"• Pronunciation rules\n• SSML processing\n• Volume control"| D

    style STT fill:#fff3e0,stroke:#e65100
    style LLM fill:#e1f5fe,stroke:#01579b
    style TTS fill:#f3e5f5,stroke:#7b1fa2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class MyAgent(Agent):
    async def stt_node(self, audio, model_settings):
        """Transcribe input audio to text"""
        async for event in Agent.default.stt_node(self, audio, model_settings):
            # Filter filler words, apply noise reduction, etc.
            yield event

    async def llm_node(self, chat_ctx, tools, model_settings):
        """Generate response with full context control"""
        # Inject RAG context, modify prompts, handle structured output
        async for chunk in Agent.default.llm_node(self, chat_ctx, tools, model_settings):
            yield chunk

    async def tts_node(self, text, model_settings):
        """Synthesize speech with custom pronunciation"""
        async def adjust_pronunciation(input_text):
            async for chunk in input_text:
                # Apply pronunciation rules, SSML, etc.
                yield chunk.replace("API", "A P I")

        async for frame in Agent.default.tts_node(self, adjust_pronunciation(text), model_settings):
            yield frame

Scripted Speech Support

The pipeline fully supports exact scripted output:

1
2
3
4
5
# Speak exact text with optional pre-synthesized audio
await session.say(
    "Your order has been confirmed. Order number: 12345.",
    allow_interruptions=False,
)

Structured LLM Output

Pipeline agents can use structured output for TTS style control:

1
2
3
4
5
class ResponseEmotion(TypedDict):
    voice_instructions: str  # "Speak warmly and enthusiastically"
    response: str            # The actual spoken text

# LLM returns structured JSON, TTS applies voice instructions

Customizability Deep Dive

Where Pipeline Wins

Capability Description
Voice selection Pair any TTS voice with any LLM
Provider flexibility Swap STT, LLM, or TTS independently
Transcript manipulation Filter/modify STT output before LLM processing
Response filtering Modify LLM output before speech synthesis
Exact scripted speech Use session.say() for compliance-critical output
Cost optimization Mix budget and premium providers strategically
Real-time transcripts Display live transcription in your UI

Where Realtime Wins

Capability Description
Emotional understanding Better comprehension of tone, hesitation, verbal cues
Expressive output More natural speech with emotional context
Simpler configuration Single model setup
Native video input Gemini Live supports true video understanding
Lower baseline latency No inter-model communication overhead

The Hybrid Approach

You can combine realtime speech comprehension with a separate TTS for output control:

flowchart LR
    subgraph User
        A[🎤 Speech Input]
        F[🔊 Speech Output]
    end

    subgraph Agent["Hybrid Architecture"]
        B[Audio In]
        C[["🧠 Realtime Model\n(text output mode)"]]
        D[["🗣️ TTS\n(Your choice)"]]
        E[Audio Out]
    end

    A --> B
    B -->|"Audio"| C
    C -->|"Text"| D
    D --> E
    E --> F

    style C fill:#e1f5fe,stroke:#01579b
    style D fill:#f3e5f5,stroke:#7b1fa2
    style Agent fill:#e8f5e9,stroke:#2e7d32
1
2
3
4
5
6
7
8
from livekit.plugins import openai

session = AgentSession(
    # Realtime model for speech understanding (text output only)
    llm=openai.realtime.RealtimeModel(modalities=["text"]),
    # Your preferred TTS for speech output
    tts="cartesia/sonic-3",
)

This “half-cascade” architecture provides:

  • Realtime speech comprehension with emotional understanding
  • Full control over output voice and pronunciation
  • Support for session.say() with exact scripts
  • Workaround for conversation history issues

Decision Framework

Choose Realtime Models When:

  • Emotional intelligence matters — Customer service, therapy, coaching applications
  • Rapid prototyping — Get something working quickly with minimal configuration
  • Video + voice integration — Using Gemini Live for multimodal input
  • Latency is critical — Every millisecond counts

Choose Pipeline When:

  • Compliance requirements — Need exact scripted disclosures or confirmations
  • Budget constraints — Want to optimize cost per component
  • Real-time transcription — UI needs to display what the user is saying
  • Custom pronunciation — Industry-specific terminology, brand names
  • Provider flexibility — Want to switch components without rewriting code
  • Conversation history — Loading and continuing previous sessions

Choose Hybrid When:

  • Best of both worlds — Realtime understanding with TTS voice control
  • Specific voice requirements — Need a particular TTS provider’s voices
  • Gradual migration — Moving from pipeline to realtime incrementally

Implementation Examples

Basic Pipeline Setup

1
2
3
4
5
6
7
from livekit.agents import AgentSession

session = AgentSession(
    stt="assemblyai/universal-streaming:en",
    llm="openai/gpt-4.1-mini",
    tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
)

Basic Realtime Setup

1
2
3
4
5
6
from livekit.agents import AgentSession
from livekit.plugins import openai

session = AgentSession(
    llm=openai.realtime.RealtimeModel(voice="alloy"),
)

Hybrid Setup

1
2
3
4
5
6
7
from livekit.agents import AgentSession
from livekit.plugins import openai

session = AgentSession(
    llm=openai.realtime.RealtimeModel(modalities=["text"]),
    tts="elevenlabs/flash-v2.5:voice-id",
)

Conclusion

There’s no universally “better” architecture—the right choice depends on your specific requirements:

Priority Recommendation
Maximum control Pipeline
Lowest latency Realtime
Lowest cost Pipeline (budget providers)
Emotional expressiveness Realtime or Hybrid
Compliance/scripted output Pipeline
Simplest setup Realtime

For production applications requiring full control, the pipeline remains the most flexible choice. For emotionally intelligent interactions where exact output control is less critical, realtime models offer a compelling experience with simpler implementation.

The hybrid approach offers an interesting middle ground, combining realtime comprehension with TTS output control—worth considering if you’re on the fence.


This analysis is based on the LiveKit Agents documentation as of January 2026. Model capabilities and pricing evolve rapidly—always check the latest documentation for current information.


This post is licensed under CC BY 4.0 by the author.