Post

The Complete Guide to LiveKit Voice AI: STT, TTS, LLM & Realtime Models Compared

A comprehensive comparison of all Speech-to-Text, Text-to-Speech, LLM, and Realtime model options available in LiveKit Agents, including pricing, pros/cons, and recommendations.

The Complete Guide to LiveKit Voice AI: STT, TTS, LLM & Realtime Models Compared

Building voice AI applications requires choosing the right combination of models for speech recognition, language understanding, and speech synthesis. LiveKit Agents provides a unified framework for integrating these components, but with dozens of options available, selecting the right stack can be overwhelming.

This guide breaks down every option available in LiveKit, complete with pricing, pros and cons, and recommendations for different use cases.

Table of Contents


Speech-to-Text (STT) Options

STT models transcribe spoken audio into text in real-time. They’re the first component in the traditional voice AI pipeline.

LiveKit Inference (Managed)

These are available through LiveKit Cloud with automatic billing—no separate API keys needed.

Deepgram

Model Price/min Languages
Flux $0.0077 English only
Nova-3 (mono) $0.0077 8 languages
Nova-3 (multi) $0.0092 Multilingual
Nova-2 $0.0058 33 languages
Nova-2 Medical/Phone $0.0058 English

Pros:

  • Excellent accuracy, especially Nova-3
  • Built-in phrase endpointing for turn detection (Flux)
  • Strong multilingual support
  • Medical-specific model available
  • Native streaming support

Cons:

  • Higher cost than AssemblyAI
  • Flux is English-only

AssemblyAI

Model Price/min Languages
Universal-Streaming $0.0025 English
Universal-Streaming-Multilingual $0.0025 6 languages

Pros:

  • Cheapest option at $0.0025/min
  • Sophisticated semantic turn detection
  • Good accuracy
  • Extra params: keyterms_prompt, confidence thresholds

Cons:

  • Limited language support (6 languages)
  • Fewer model variants than Deepgram

Cartesia (Ink Whisper)

Model Price/min Languages
Ink Whisper $0.0030 ($0.0023 Scale) 98 languages

Pros:

  • Widest language support (98 languages)
  • Low cost
  • Good for international apps

Cons:

  • Newer, less proven than Deepgram/AssemblyAI
  • No built-in turn detection

ElevenLabs (Scribe V2)

Model Price/min Languages
Scribe V2 Realtime $0.0105 41 languages

Pros:

  • Strong multilingual support (41 languages)
  • Automatic language detection
  • Good for pairing with ElevenLabs TTS

Cons:

  • Most expensive option
  • No built-in turn detection

Plugin-Based STT Options (Bring Your Own Key)

Provider Streaming Turn Detection Languages Notes
OpenAI (gpt-4o-transcribe) No* No 57+ Very accurate, needs VAD/StreamAdapter
Groq (Whisper) No* No 57+ Fast inference, cheap via Groq pricing
Google Cloud Yes No 125+ Enterprise, requires GCP account
Azure Speech Yes No 100+ Enterprise, good accuracy

*Requires VAD + StreamAdapter for streaming use

STT Recommendation Guide

Use Case Recommendation
Lowest cost AssemblyAI ($0.0025/min)
Best accuracy (English) Deepgram Nova-3 or Flux
Most languages Cartesia Ink Whisper (98 langs)
Turn detection built-in Deepgram Flux or AssemblyAI
Enterprise/compliance Azure or Google Cloud
Budget + multilingual Cartesia ($0.003/min, 98 langs)

STT Pricing Summary (per minute)

1
2
3
4
5
AssemblyAI:     $0.0025  (cheapest)
Cartesia:       $0.0030
Deepgram Nova-2: $0.0058
Deepgram Nova-3: $0.0077
ElevenLabs:     $0.0105  (most expensive)

Text-to-Speech (TTS) Options

TTS models convert text into natural-sounding speech. Quality, latency, and voice selection vary significantly between providers.

LiveKit Inference (Managed)

Inworld

Model Price/1M chars Languages
Inworld TTS 1 $5.00 12 languages
Inworld TTS 1 Max $10.00 12 languages

Pros:

  • Cheapest option ($5/1M chars)
  • Good multilingual support (12 languages)
  • Natural, warm voices
  • Designed for interactive/gaming use cases

Cons:

  • Smaller voice library than ElevenLabs/Cartesia
  • No custom voice cloning via Inference
  • Less known in voice AI space

Deepgram (Aura)

Model Price/1M chars Languages
Aura-1 $15.00 English only
Aura-2 $30.00 ($27 Scale) English, Spanish

Pros:

  • Low cost for English-only apps
  • Good for pairing with Deepgram STT (same vendor)
  • Professional, natural-sounding voices
  • Fast latency

Cons:

  • Very limited language support (English + Spanish only)
  • Smaller voice selection
  • No voice cloning

Rime

Model Price/1M chars Languages
Mist V2 $30.00 ($20 Scale) 4 languages
Arcana V2 $40.00 ($30 Scale) 4 languages

Pros:

  • Good voice quality
  • Distinctive voice personalities (Gen-Z, expressive)
  • Good for younger/casual brand voices

Cons:

  • Limited language support (en, es, fr, de only)
  • Smaller voice library
  • Mid-range pricing

Cartesia (Sonic)

Model Price/1M chars Languages
Sonic / Sonic-2 / Sonic-3 / Turbo $50.00 ($37.50 Scale) 15-42 languages

Pros:

  • Excellent voice quality
  • Strong multilingual support (Sonic-3: 42 languages)
  • Emotion controls (excited, sad, etc.)
  • Speed and volume adjustments
  • Large voice library
  • Lowest latency (time-to-first-byte)

Cons:

  • Mid-high pricing
  • No voice cloning via Inference
  • Same price across all model variants

ElevenLabs

Model Price/1M chars Languages
Flash v2/v2.5 $150.00 ($60 Scale) 1-32 languages
Turbo v2/v2.5 $150.00 ($60 Scale) 1-32 languages
Multilingual v2 $300.00 ($120 Scale) 29 languages

Pros:

  • Best voice quality (most human-like)
  • Huge voice library
  • Industry-leading expressiveness
  • Voice cloning support (via plugin)
  • Good multilingual support

Cons:

  • Most expensive option by far
  • No custom/cloned voices via Inference
  • Flash vs Turbo pricing identical

Plugin-Based TTS Options (Bring Your Own Key)

Provider Voice Cloning Languages Notes
OpenAI No 57+ Simple, good quality, limited voices
Azure Speech Yes 100+ Enterprise, SSML support
Google Cloud No 40+ WaveNet/Neural2 voices
Amazon Polly No 30+ NTTS voices, cost-effective
Hume No English Emotionally expressive AI
LMNT Yes English Fast, low latency
PlayHT Yes 29+ Good cloning, emotions

TTS Recommendation Guide

Use Case Recommendation
Lowest cost Inworld ($5/1M chars)
Best quality ElevenLabs (premium) or Cartesia
English-only budget Deepgram Aura-1 ($15/1M)
Most languages Cartesia Sonic-3 (42 langs)
Voice cloning ElevenLabs or PlayHT (plugin)
Gaming/interactive Inworld
Casual/young brand Rime Arcana
Enterprise/compliance Azure or Google Cloud
Lowest latency Cartesia Sonic

TTS Pricing Summary (per 1M characters)

1
2
3
4
5
6
7
8
9
Inworld TTS 1:      $5.00   (cheapest)
Inworld TTS 1 Max:  $10.00
Deepgram Aura-1:    $15.00
Deepgram Aura-2:    $30.00
Rime Mist V2:       $30.00
Rime Arcana:        $40.00
Cartesia Sonic:     $50.00
ElevenLabs Turbo:   $150.00
ElevenLabs Multi:   $300.00 (most expensive)

TTS Cost per Hour of Speech Output

Assuming ~15,000 characters per hour of speech:

Provider Cost/hour
Inworld TTS 1 ~$0.075
Deepgram Aura-1 ~$0.23
Cartesia ~$0.75
ElevenLabs Turbo ~$2.25
ElevenLabs Multi ~$4.50

Large Language Model (LLM) Options

The LLM is the brain of your voice agent, handling reasoning, responses, and tool orchestration.

LiveKit Inference (Managed)

OpenAI GPT Models

Model Input/1M Cached/1M Output/1M
GPT-4o $2.50 $1.25 $10.00
GPT-4o mini $0.15 $0.075 $0.60
GPT-4.1 $2.00 $0.50 $8.00
GPT-4.1 mini $0.40 $0.10 $1.60
GPT-4.1 nano $0.10 $0.025 $0.40
GPT-5 $1.25 $0.125 $10.00
GPT-5 mini $0.25 $0.025 $2.00
GPT-5 nano $0.05 $0.005 $0.40
GPT-5.1 $1.25 $0.125 $10.00
GPT-5.2 $1.75 $0.175 $14.00

Pros:

  • Industry standard, excellent tool calling
  • Wide range of model sizes (nano to full)
  • Strong reasoning capabilities (GPT-5+)
  • Cached input discounts (50-75% off)
  • Available via Azure or OpenAI endpoints

Cons:

  • Premium pricing for flagship models
  • Output tokens more expensive than input

Google Gemini

Model Input/1M Cached/1M Output/1M
Gemini 2.0 Flash Lite $0.075 N/A $0.30
Gemini 2.0 Flash $0.10 N/A $0.40
Gemini 2.5 Flash Lite $0.10 $0.01 $0.40
Gemini 2.5 Flash $0.30 $0.03 $2.50
Gemini 2.5 Pro $2.50 $0.25 $15.00
Gemini 3 Flash $0.50 $0.05 $3.00
Gemini 3 Pro $4.00 $0.40 $18.00

Pros:

  • Cheapest options available (2.0 Flash Lite: $0.075 input)
  • Excellent multimodal (vision) support
  • Strong reasoning at lower cost than GPT
  • Good tool calling support
  • Huge context windows

Cons:

  • Some models still in preview (Gemini 3)
  • Less ecosystem tooling than OpenAI

DeepSeek

Model Input/1M Output/1M
DeepSeek V3 $0.77 $0.77
DeepSeek V3.2 $0.30 $0.45

Pros:

  • Excellent value for quality
  • Strong coding abilities
  • Good reasoning performance
  • Competitive with GPT-4 class models

Cons:

  • No cached input pricing
  • Fewer model variants
  • Limited provider options (Baseten only)

Kimi K2

Model Input/1M Output/1M
Kimi K2 Instruct $0.60 $2.50

Pros:

  • Good reasoning capabilities
  • Competitive pricing
  • Strong at complex tasks

Cons:

  • Single model option
  • Less established than OpenAI/Google
  • No cached input pricing

GPT-OSS 120B (Open Source)

Provider Input/1M Cached/1M Output/1M
Groq $0.15 $0.075 $0.60
Cerebras $0.35 N/A $0.75

Pros:

  • Open source model
  • Very low cost via Groq
  • Fast inference (especially Groq)
  • No vendor lock-in

Cons:

  • Less capable than proprietary models
  • Limited to specific providers

Plugin-Based LLM Options (Bring Your Own Key)

Anthropic Claude

Model Input/1M Output/1M Notes
Claude 3.5 Sonnet $3.00 $15.00 Best balanced
Claude 3.5 Haiku $0.80 $4.00 Fast & cheap
Claude 3 Opus $15.00 $75.00 Most capable

Pros:

  • Excellent reasoning and instruction following
  • Strong safety features
  • Great at nuanced conversations
  • Parallel tool calls support

Cons:

  • Not available via LiveKit Inference
  • Higher output pricing
  • Python plugin only

Groq (Direct)

Model Input/1M Output/1M
Llama 3.3 70B $0.59 $0.79
Llama 3.1 8B $0.05 $0.08
Mixtral 8x7B $0.24 $0.24

Pros:

  • Fastest inference (LPU hardware)
  • Open source models
  • Very low latency for voice AI
  • Great for real-time applications

Cons:

  • Limited model selection
  • Open source models less capable
  • Rate limits on free tier

Other Plugin Options

Provider Best For
Azure OpenAI Enterprise, compliance, regional deployment
Amazon Bedrock AWS ecosystem, Claude access
Mistral AI European hosting, open weights
Together AI Wide model selection, fine-tuning
Fireworks Fast inference, function calling
Ollama Local/self-hosted, privacy
Perplexity Search-augmented responses

LLM Recommendation Guide

Use Case Recommendation
Lowest cost Gemini 2.0 Flash Lite ($0.075/$0.30)
Best quality GPT-5.2 or Claude 3.5 Sonnet
Best value DeepSeek V3.2 or Gemini 2.5 Flash
Fastest inference Groq (Llama models)
Voice AI optimized GPT-4.1 mini or Gemini 2.5 Flash
Complex reasoning GPT-5 or Claude 3 Opus
Budget + good quality GPT-4o mini or GPT-OSS 120B
Enterprise/compliance Azure OpenAI
Self-hosted Ollama

LLM Pricing Summary (per 1M tokens, Input/Output)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
BUDGET TIER:
Gemini 2.0 Flash Lite:  $0.075 / $0.30  (cheapest)
GPT-5 nano:             $0.05  / $0.40
GPT-4.1 nano:           $0.10  / $0.40
GPT-OSS 120B (Groq):    $0.15  / $0.60
GPT-4o mini:            $0.15  / $0.60

MID TIER:
DeepSeek V3.2:          $0.30  / $0.45
Gemini 2.5 Flash:       $0.30  / $2.50
GPT-4.1 mini:           $0.40  / $1.60
Kimi K2:                $0.60  / $2.50

PREMIUM TIER:
GPT-5:                  $1.25  / $10.00
GPT-4.1:                $2.00  / $8.00
Gemini 2.5 Pro:         $2.50  / $15.00
GPT-4o:                 $2.50  / $10.00
Claude 3.5 Sonnet:      $3.00  / $15.00
Gemini 3 Pro:           $4.00  / $18.00

LLM Cost per Hour of Voice Conversation

Assuming ~20K input tokens + ~5K output tokens per hour:

Model Est. Cost/hour
Gemini 2.0 Flash Lite ~$0.003
GPT-5 nano ~$0.003
GPT-OSS 120B (Groq) ~$0.006
GPT-4o mini ~$0.006
DeepSeek V3.2 ~$0.008
Gemini 2.5 Flash ~$0.019
GPT-4.1 mini ~$0.016
GPT-4o ~$0.10
Claude 3.5 Sonnet ~$0.12

Realtime Model Options

Realtime models process speech directly (speech-to-speech), bypassing the traditional STT→LLM→TTS pipeline. They understand emotional context and verbal cues better than text-based pipelines.

All realtime models are plugin-based (bring your own API key).

OpenAI Realtime API

Model Audio Input/1M Audio Output/1M Text Input/1M Text Output/1M
gpt-realtime $32 $64 $5 $20
gpt-realtime-mini ~$10 ~$20 $0.60 $2.40
gpt-4o-realtime (legacy) $100 $200 $5 $20

Per-minute estimate: ~$0.06 input + $0.24 output = ~$0.30/min

Pros:

  • Most mature realtime API
  • Excellent voice quality and expressiveness
  • Multiple voices available (alloy, marin, etc.)
  • Semantic VAD for intelligent turn detection
  • Text-only mode for use with separate TTS
  • Vision support (images/video input)
  • Azure OpenAI also available

Cons:

  • Most expensive realtime option
  • Delayed transcriptions (not real-time)
  • History loading can cause text-only responses
  • Limited to OpenAI ecosystem

Best for: Premium voice experiences, complex reasoning tasks

Google Gemini Live API

Model Input (Text)/1M Output (Text)/1M Audio
Gemini 2.5 Flash $0.15 $0.60 ~1,500 tokens/min
Gemini 2.0 Flash $0.10 $0.40 ~1,500 tokens/min

Per-minute estimate: ~$0.03-0.05/min (significantly cheaper than OpenAI)

Pros:

  • Much cheaper than OpenAI Realtime
  • Thinking mode support (Gemini 2.5)
  • Affective dialog (emotional responses)
  • Proactive audio (model can choose not to respond)
  • Multiple voices (Puck, Charon, etc.)
  • Text-only mode available
  • Vertex AI or Google AI API
  • Free tier available during preview

Cons:

  • Newer, less mature than OpenAI
  • Built-in VAD only (no semantic VAD)
  • Fewer voice options than OpenAI
  • Preview features may change

Best for: Cost-effective realtime voice, Google Cloud users

Amazon Nova Sonic

Model Speech Input/1K Speech Output/1K Text Input/1K Text Output/1K
Nova Sonic $0.0034 $0.0136 $0.00006 $0.00024
Nova 2 Sonic Similar Similar Similar Similar

Per-minute estimate: ~$0.04-0.06/min (~80% cheaper than OpenAI)

Pros:

  • Best price/performance ratio
  • Very low latency
  • 1M token context window (Nova 2)
  • Multiple languages (EN, ES, PT, HI)
  • Polyglot voices
  • AWS ecosystem integration
  • Natural conversation flow

Cons:

  • Python only (no Node.js)
  • AWS-only (Bedrock)
  • Fewer customization options
  • VAD-based turn detection only
  • Newer platform

Best for: AWS users, cost-sensitive production deployments

xAI Grok Voice Agent API

Pros:

  • OpenAI Realtime API compatible
  • Built-in X (Twitter) search tool
  • Web search capabilities
  • File/knowledge base search
  • Unique personality options

Cons:

  • Python only
  • Newer platform
  • Limited documentation
  • Smaller ecosystem

Best for: Apps needing X/social media integration

Ultravox

Pros:

  • All-in-one STT+LLM+TTS
  • Simple integration
  • Quick setup
  • Good for prototyping

Cons:

  • Python only
  • Less control over individual components
  • Smaller community
  • Limited customization

Best for: Rapid prototyping, simple voice agents

Realtime Feature Comparison

Feature OpenAI Gemini Live Nova Sonic xAI Grok Ultravox
Python
Node.js
Semantic VAD
Text-only mode
Thinking/Reasoning
Vision input
Tool calling
Affective dialog

Realtime Recommendation Guide

Use Case Recommendation
Lowest cost Nova Sonic or Gemini Live
Best quality OpenAI Realtime
AWS ecosystem Nova Sonic
Google Cloud Gemini Live
Social/X integration xAI Grok
Fastest setup Ultravox
Production-ready OpenAI or Nova Sonic
Node.js required OpenAI or Gemini Live

Realtime Cost Comparison (per 10-minute conversation)

Assuming 5 min user speech + 5 min agent speech:

Provider Est. Cost
Nova Sonic ~$0.50
Gemini Live ~$0.40-0.50
OpenAI gpt-realtime ~$1.50
OpenAI gpt-4o-realtime ~$3.00

Limitations of All Realtime Models

  1. Delayed transcriptions - User transcripts often arrive after agent response
  2. No scripted speech - Can’t guarantee exact text output via say()
  3. History as text only - Loses emotional context when loading history
  4. Turn detection tradeoffs - Built-in VAD may not be as accurate as pipeline STT

Workaround: Use realtime model with modalities=["text"] + separate TTS for full control over speech output while keeping realtime speech understanding.


If you’re using Claude (Anthropic) as your LLM—which doesn’t have a realtime speech model—you’ll use the traditional STT → LLM → TTS pipeline. Here’s the optimal configuration:

Best Overall Stack

Component Recommendation Why
STT Deepgram Flux Fastest STT, semantic turn detection built-in
LLM Claude 3.5 Sonnet Best accuracy/speed balance
TTS Cartesia Sonic-3 Lowest time-to-first-byte
VAD Silero Fast, lightweight
Turn Detection turn_detection="stt" Uses Flux’s semantic endpointing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from livekit.agents import AgentSession
from livekit.plugins import anthropic, deepgram, cartesia, silero

session = AgentSession(
    stt=deepgram.STT(model="nova-3"),  # or "flux" for English
    llm=anthropic.LLM(
        model="claude-sonnet-4-20250514",
        temperature=0.7,
    ),
    tts=cartesia.TTS(
        model="sonic-english",
        voice="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
    ),
    vad=silero.VAD.load(),
    turn_detection="stt",  # Use Deepgram's semantic turn detection
)

Why These Choices?

STT: Deepgram Flux/Nova-3

  • ~100-200ms to first transcript
  • Industry-leading accuracy for English
  • Semantic turn detection (knows when user is done thinking, not just pausing)

LLM: Claude 3.5 Sonnet

  • ~200-400ms time-to-first-token
  • Excellent instruction following, nuanced responses
  • Parallel tool calls for complex workflows

TTS: Cartesia Sonic

  • ~50-100ms time-to-first-audio (fastest in class)
  • Natural, expressive voices
  • Emotion controls, speed adjustments

Latency Optimization Tips

  1. Use streaming everywhere - All components should stream
  2. Sentence-based TTS - Let TTS start on first sentence, not full response
  3. Turn detection - turn_detection="stt" reduces false triggers
  4. Keep system prompts short - Fewer input tokens = faster LLM response
  5. Regional deployment - Deploy agents close to your users

Cost Calculator

Per-Minute Costs

Component Rate Per Minute
Deepgram Flux (STT) $0.0077/min $0.0077
Claude 3.5 Sonnet (LLM) $3 input / $15 output per 1M ~$0.015-0.03
Cartesia Sonic-3 (TTS) $50/1M chars ~$0.04
Total   ~$0.06-0.08/min

Hourly Costs

Scenario STT LLM TTS Total/Hour
Light conversation (50% talk time) $0.23 $0.50 $1.20 ~$1.93
Active conversation (70% talk time) $0.32 $0.90 $1.70 ~$2.92
Heavy conversation (90% talk time) $0.42 $1.20 $2.20 ~$3.82

Per-Conversation Estimates

Duration Est. Cost
5 min call $0.30 - $0.40
10 min call $0.60 - $0.80
15 min call $0.90 - $1.20
30 min call $1.80 - $2.40

Monthly Projections

Usage Hours/Month Monthly Cost
Light (1 hr/day) 30 hrs ~$60-90
Medium (4 hrs/day) 120 hrs ~$240-360
Heavy (8 hrs/day) 240 hrs ~$480-720
High volume (24/7) 720 hrs ~$1,440-2,160

Budget Alternative Stack

Component Budget Choice Cost
STT AssemblyAI $0.0025/min
LLM Claude 3.5 Haiku $0.80/$4 per 1M
TTS Deepgram Aura-2 $30/1M chars

This brings cost down to ~$0.80-1.20/hr with slightly higher latency.


Conclusion

Choosing the right voice AI stack depends on your priorities:

  • Lowest cost: AssemblyAI + Gemini Flash Lite + Inworld
  • Lowest latency: Deepgram Flux + Claude Haiku + Cartesia Sonic
  • Best quality: Deepgram Nova-3 + Claude Sonnet + ElevenLabs
  • Simplest setup: OpenAI Realtime or Gemini Live (single model)
  • AWS ecosystem: Nova Sonic (realtime) or Bedrock models

For most production applications targeting low latency and high accuracy with Claude, the Deepgram + Claude + Cartesia combination offers the best balance at ~$2-3/hour.


Last updated: January 2025. Pricing and features may change—always check the official documentation for the latest information.

Resources:

This post is licensed under CC BY 4.0 by the author.