The Complete Guide to LiveKit Voice AI: STT, TTS, LLM & Realtime Models Compared
A comprehensive comparison of all Speech-to-Text, Text-to-Speech, LLM, and Realtime model options available in LiveKit Agents, including pricing, pros/cons, and recommendations.
Building voice AI applications requires choosing the right combination of models for speech recognition, language understanding, and speech synthesis. LiveKit Agents provides a unified framework for integrating these components, but with dozens of options available, selecting the right stack can be overwhelming.
This guide breaks down every option available in LiveKit, complete with pricing, pros and cons, and recommendations for different use cases.
Table of Contents
- Speech-to-Text (STT) Options
- Text-to-Speech (TTS) Options
- Large Language Model (LLM) Options
- Realtime Model Options
- Recommended Stack for Low Latency + High Accuracy
- Cost Calculator
Speech-to-Text (STT) Options
STT models transcribe spoken audio into text in real-time. They’re the first component in the traditional voice AI pipeline.
LiveKit Inference (Managed)
These are available through LiveKit Cloud with automatic billing—no separate API keys needed.
Deepgram
| Model | Price/min | Languages |
|---|---|---|
| Flux | $0.0077 | English only |
| Nova-3 (mono) | $0.0077 | 8 languages |
| Nova-3 (multi) | $0.0092 | Multilingual |
| Nova-2 | $0.0058 | 33 languages |
| Nova-2 Medical/Phone | $0.0058 | English |
Pros:
- Excellent accuracy, especially Nova-3
- Built-in phrase endpointing for turn detection (Flux)
- Strong multilingual support
- Medical-specific model available
- Native streaming support
Cons:
- Higher cost than AssemblyAI
- Flux is English-only
AssemblyAI
| Model | Price/min | Languages |
|---|---|---|
| Universal-Streaming | $0.0025 | English |
| Universal-Streaming-Multilingual | $0.0025 | 6 languages |
Pros:
- Cheapest option at $0.0025/min
- Sophisticated semantic turn detection
- Good accuracy
- Extra params:
keyterms_prompt, confidence thresholds
Cons:
- Limited language support (6 languages)
- Fewer model variants than Deepgram
Cartesia (Ink Whisper)
| Model | Price/min | Languages |
|---|---|---|
| Ink Whisper | $0.0030 ($0.0023 Scale) | 98 languages |
Pros:
- Widest language support (98 languages)
- Low cost
- Good for international apps
Cons:
- Newer, less proven than Deepgram/AssemblyAI
- No built-in turn detection
ElevenLabs (Scribe V2)
| Model | Price/min | Languages |
|---|---|---|
| Scribe V2 Realtime | $0.0105 | 41 languages |
Pros:
- Strong multilingual support (41 languages)
- Automatic language detection
- Good for pairing with ElevenLabs TTS
Cons:
- Most expensive option
- No built-in turn detection
Plugin-Based STT Options (Bring Your Own Key)
| Provider | Streaming | Turn Detection | Languages | Notes |
|---|---|---|---|---|
| OpenAI (gpt-4o-transcribe) | No* | No | 57+ | Very accurate, needs VAD/StreamAdapter |
| Groq (Whisper) | No* | No | 57+ | Fast inference, cheap via Groq pricing |
| Google Cloud | Yes | No | 125+ | Enterprise, requires GCP account |
| Azure Speech | Yes | No | 100+ | Enterprise, good accuracy |
*Requires VAD + StreamAdapter for streaming use
STT Recommendation Guide
| Use Case | Recommendation |
|---|---|
| Lowest cost | AssemblyAI ($0.0025/min) |
| Best accuracy (English) | Deepgram Nova-3 or Flux |
| Most languages | Cartesia Ink Whisper (98 langs) |
| Turn detection built-in | Deepgram Flux or AssemblyAI |
| Enterprise/compliance | Azure or Google Cloud |
| Budget + multilingual | Cartesia ($0.003/min, 98 langs) |
STT Pricing Summary (per minute)
1
2
3
4
5
AssemblyAI: $0.0025 (cheapest)
Cartesia: $0.0030
Deepgram Nova-2: $0.0058
Deepgram Nova-3: $0.0077
ElevenLabs: $0.0105 (most expensive)
Text-to-Speech (TTS) Options
TTS models convert text into natural-sounding speech. Quality, latency, and voice selection vary significantly between providers.
LiveKit Inference (Managed)
Inworld
| Model | Price/1M chars | Languages |
|---|---|---|
| Inworld TTS 1 | $5.00 | 12 languages |
| Inworld TTS 1 Max | $10.00 | 12 languages |
Pros:
- Cheapest option ($5/1M chars)
- Good multilingual support (12 languages)
- Natural, warm voices
- Designed for interactive/gaming use cases
Cons:
- Smaller voice library than ElevenLabs/Cartesia
- No custom voice cloning via Inference
- Less known in voice AI space
Deepgram (Aura)
| Model | Price/1M chars | Languages |
|---|---|---|
| Aura-1 | $15.00 | English only |
| Aura-2 | $30.00 ($27 Scale) | English, Spanish |
Pros:
- Low cost for English-only apps
- Good for pairing with Deepgram STT (same vendor)
- Professional, natural-sounding voices
- Fast latency
Cons:
- Very limited language support (English + Spanish only)
- Smaller voice selection
- No voice cloning
Rime
| Model | Price/1M chars | Languages |
|---|---|---|
| Mist V2 | $30.00 ($20 Scale) | 4 languages |
| Arcana V2 | $40.00 ($30 Scale) | 4 languages |
Pros:
- Good voice quality
- Distinctive voice personalities (Gen-Z, expressive)
- Good for younger/casual brand voices
Cons:
- Limited language support (en, es, fr, de only)
- Smaller voice library
- Mid-range pricing
Cartesia (Sonic)
| Model | Price/1M chars | Languages |
|---|---|---|
| Sonic / Sonic-2 / Sonic-3 / Turbo | $50.00 ($37.50 Scale) | 15-42 languages |
Pros:
- Excellent voice quality
- Strong multilingual support (Sonic-3: 42 languages)
- Emotion controls (excited, sad, etc.)
- Speed and volume adjustments
- Large voice library
- Lowest latency (time-to-first-byte)
Cons:
- Mid-high pricing
- No voice cloning via Inference
- Same price across all model variants
ElevenLabs
| Model | Price/1M chars | Languages |
|---|---|---|
| Flash v2/v2.5 | $150.00 ($60 Scale) | 1-32 languages |
| Turbo v2/v2.5 | $150.00 ($60 Scale) | 1-32 languages |
| Multilingual v2 | $300.00 ($120 Scale) | 29 languages |
Pros:
- Best voice quality (most human-like)
- Huge voice library
- Industry-leading expressiveness
- Voice cloning support (via plugin)
- Good multilingual support
Cons:
- Most expensive option by far
- No custom/cloned voices via Inference
- Flash vs Turbo pricing identical
Plugin-Based TTS Options (Bring Your Own Key)
| Provider | Voice Cloning | Languages | Notes |
|---|---|---|---|
| OpenAI | No | 57+ | Simple, good quality, limited voices |
| Azure Speech | Yes | 100+ | Enterprise, SSML support |
| Google Cloud | No | 40+ | WaveNet/Neural2 voices |
| Amazon Polly | No | 30+ | NTTS voices, cost-effective |
| Hume | No | English | Emotionally expressive AI |
| LMNT | Yes | English | Fast, low latency |
| PlayHT | Yes | 29+ | Good cloning, emotions |
TTS Recommendation Guide
| Use Case | Recommendation |
|---|---|
| Lowest cost | Inworld ($5/1M chars) |
| Best quality | ElevenLabs (premium) or Cartesia |
| English-only budget | Deepgram Aura-1 ($15/1M) |
| Most languages | Cartesia Sonic-3 (42 langs) |
| Voice cloning | ElevenLabs or PlayHT (plugin) |
| Gaming/interactive | Inworld |
| Casual/young brand | Rime Arcana |
| Enterprise/compliance | Azure or Google Cloud |
| Lowest latency | Cartesia Sonic |
TTS Pricing Summary (per 1M characters)
1
2
3
4
5
6
7
8
9
Inworld TTS 1: $5.00 (cheapest)
Inworld TTS 1 Max: $10.00
Deepgram Aura-1: $15.00
Deepgram Aura-2: $30.00
Rime Mist V2: $30.00
Rime Arcana: $40.00
Cartesia Sonic: $50.00
ElevenLabs Turbo: $150.00
ElevenLabs Multi: $300.00 (most expensive)
TTS Cost per Hour of Speech Output
Assuming ~15,000 characters per hour of speech:
| Provider | Cost/hour |
|---|---|
| Inworld TTS 1 | ~$0.075 |
| Deepgram Aura-1 | ~$0.23 |
| Cartesia | ~$0.75 |
| ElevenLabs Turbo | ~$2.25 |
| ElevenLabs Multi | ~$4.50 |
Large Language Model (LLM) Options
The LLM is the brain of your voice agent, handling reasoning, responses, and tool orchestration.
LiveKit Inference (Managed)
OpenAI GPT Models
| Model | Input/1M | Cached/1M | Output/1M |
|---|---|---|---|
| GPT-4o | $2.50 | $1.25 | $10.00 |
| GPT-4o mini | $0.15 | $0.075 | $0.60 |
| GPT-4.1 | $2.00 | $0.50 | $8.00 |
| GPT-4.1 mini | $0.40 | $0.10 | $1.60 |
| GPT-4.1 nano | $0.10 | $0.025 | $0.40 |
| GPT-5 | $1.25 | $0.125 | $10.00 |
| GPT-5 mini | $0.25 | $0.025 | $2.00 |
| GPT-5 nano | $0.05 | $0.005 | $0.40 |
| GPT-5.1 | $1.25 | $0.125 | $10.00 |
| GPT-5.2 | $1.75 | $0.175 | $14.00 |
Pros:
- Industry standard, excellent tool calling
- Wide range of model sizes (nano to full)
- Strong reasoning capabilities (GPT-5+)
- Cached input discounts (50-75% off)
- Available via Azure or OpenAI endpoints
Cons:
- Premium pricing for flagship models
- Output tokens more expensive than input
Google Gemini
| Model | Input/1M | Cached/1M | Output/1M |
|---|---|---|---|
| Gemini 2.0 Flash Lite | $0.075 | N/A | $0.30 |
| Gemini 2.0 Flash | $0.10 | N/A | $0.40 |
| Gemini 2.5 Flash Lite | $0.10 | $0.01 | $0.40 |
| Gemini 2.5 Flash | $0.30 | $0.03 | $2.50 |
| Gemini 2.5 Pro | $2.50 | $0.25 | $15.00 |
| Gemini 3 Flash | $0.50 | $0.05 | $3.00 |
| Gemini 3 Pro | $4.00 | $0.40 | $18.00 |
Pros:
- Cheapest options available (2.0 Flash Lite: $0.075 input)
- Excellent multimodal (vision) support
- Strong reasoning at lower cost than GPT
- Good tool calling support
- Huge context windows
Cons:
- Some models still in preview (Gemini 3)
- Less ecosystem tooling than OpenAI
DeepSeek
| Model | Input/1M | Output/1M |
|---|---|---|
| DeepSeek V3 | $0.77 | $0.77 |
| DeepSeek V3.2 | $0.30 | $0.45 |
Pros:
- Excellent value for quality
- Strong coding abilities
- Good reasoning performance
- Competitive with GPT-4 class models
Cons:
- No cached input pricing
- Fewer model variants
- Limited provider options (Baseten only)
Kimi K2
| Model | Input/1M | Output/1M |
|---|---|---|
| Kimi K2 Instruct | $0.60 | $2.50 |
Pros:
- Good reasoning capabilities
- Competitive pricing
- Strong at complex tasks
Cons:
- Single model option
- Less established than OpenAI/Google
- No cached input pricing
GPT-OSS 120B (Open Source)
| Provider | Input/1M | Cached/1M | Output/1M |
|---|---|---|---|
| Groq | $0.15 | $0.075 | $0.60 |
| Cerebras | $0.35 | N/A | $0.75 |
Pros:
- Open source model
- Very low cost via Groq
- Fast inference (especially Groq)
- No vendor lock-in
Cons:
- Less capable than proprietary models
- Limited to specific providers
Plugin-Based LLM Options (Bring Your Own Key)
Anthropic Claude
| Model | Input/1M | Output/1M | Notes |
|---|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 | Best balanced |
| Claude 3.5 Haiku | $0.80 | $4.00 | Fast & cheap |
| Claude 3 Opus | $15.00 | $75.00 | Most capable |
Pros:
- Excellent reasoning and instruction following
- Strong safety features
- Great at nuanced conversations
- Parallel tool calls support
Cons:
- Not available via LiveKit Inference
- Higher output pricing
- Python plugin only
Groq (Direct)
| Model | Input/1M | Output/1M |
|---|---|---|
| Llama 3.3 70B | $0.59 | $0.79 |
| Llama 3.1 8B | $0.05 | $0.08 |
| Mixtral 8x7B | $0.24 | $0.24 |
Pros:
- Fastest inference (LPU hardware)
- Open source models
- Very low latency for voice AI
- Great for real-time applications
Cons:
- Limited model selection
- Open source models less capable
- Rate limits on free tier
Other Plugin Options
| Provider | Best For |
|---|---|
| Azure OpenAI | Enterprise, compliance, regional deployment |
| Amazon Bedrock | AWS ecosystem, Claude access |
| Mistral AI | European hosting, open weights |
| Together AI | Wide model selection, fine-tuning |
| Fireworks | Fast inference, function calling |
| Ollama | Local/self-hosted, privacy |
| Perplexity | Search-augmented responses |
LLM Recommendation Guide
| Use Case | Recommendation |
|---|---|
| Lowest cost | Gemini 2.0 Flash Lite ($0.075/$0.30) |
| Best quality | GPT-5.2 or Claude 3.5 Sonnet |
| Best value | DeepSeek V3.2 or Gemini 2.5 Flash |
| Fastest inference | Groq (Llama models) |
| Voice AI optimized | GPT-4.1 mini or Gemini 2.5 Flash |
| Complex reasoning | GPT-5 or Claude 3 Opus |
| Budget + good quality | GPT-4o mini or GPT-OSS 120B |
| Enterprise/compliance | Azure OpenAI |
| Self-hosted | Ollama |
LLM Pricing Summary (per 1M tokens, Input/Output)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
BUDGET TIER:
Gemini 2.0 Flash Lite: $0.075 / $0.30 (cheapest)
GPT-5 nano: $0.05 / $0.40
GPT-4.1 nano: $0.10 / $0.40
GPT-OSS 120B (Groq): $0.15 / $0.60
GPT-4o mini: $0.15 / $0.60
MID TIER:
DeepSeek V3.2: $0.30 / $0.45
Gemini 2.5 Flash: $0.30 / $2.50
GPT-4.1 mini: $0.40 / $1.60
Kimi K2: $0.60 / $2.50
PREMIUM TIER:
GPT-5: $1.25 / $10.00
GPT-4.1: $2.00 / $8.00
Gemini 2.5 Pro: $2.50 / $15.00
GPT-4o: $2.50 / $10.00
Claude 3.5 Sonnet: $3.00 / $15.00
Gemini 3 Pro: $4.00 / $18.00
LLM Cost per Hour of Voice Conversation
Assuming ~20K input tokens + ~5K output tokens per hour:
| Model | Est. Cost/hour |
|---|---|
| Gemini 2.0 Flash Lite | ~$0.003 |
| GPT-5 nano | ~$0.003 |
| GPT-OSS 120B (Groq) | ~$0.006 |
| GPT-4o mini | ~$0.006 |
| DeepSeek V3.2 | ~$0.008 |
| Gemini 2.5 Flash | ~$0.019 |
| GPT-4.1 mini | ~$0.016 |
| GPT-4o | ~$0.10 |
| Claude 3.5 Sonnet | ~$0.12 |
Realtime Model Options
Realtime models process speech directly (speech-to-speech), bypassing the traditional STT→LLM→TTS pipeline. They understand emotional context and verbal cues better than text-based pipelines.
All realtime models are plugin-based (bring your own API key).
OpenAI Realtime API
| Model | Audio Input/1M | Audio Output/1M | Text Input/1M | Text Output/1M |
|---|---|---|---|---|
| gpt-realtime | $32 | $64 | $5 | $20 |
| gpt-realtime-mini | ~$10 | ~$20 | $0.60 | $2.40 |
| gpt-4o-realtime (legacy) | $100 | $200 | $5 | $20 |
Per-minute estimate: ~$0.06 input + $0.24 output = ~$0.30/min
Pros:
- Most mature realtime API
- Excellent voice quality and expressiveness
- Multiple voices available (alloy, marin, etc.)
- Semantic VAD for intelligent turn detection
- Text-only mode for use with separate TTS
- Vision support (images/video input)
- Azure OpenAI also available
Cons:
- Most expensive realtime option
- Delayed transcriptions (not real-time)
- History loading can cause text-only responses
- Limited to OpenAI ecosystem
Best for: Premium voice experiences, complex reasoning tasks
Google Gemini Live API
| Model | Input (Text)/1M | Output (Text)/1M | Audio |
|---|---|---|---|
| Gemini 2.5 Flash | $0.15 | $0.60 | ~1,500 tokens/min |
| Gemini 2.0 Flash | $0.10 | $0.40 | ~1,500 tokens/min |
Per-minute estimate: ~$0.03-0.05/min (significantly cheaper than OpenAI)
Pros:
- Much cheaper than OpenAI Realtime
- Thinking mode support (Gemini 2.5)
- Affective dialog (emotional responses)
- Proactive audio (model can choose not to respond)
- Multiple voices (Puck, Charon, etc.)
- Text-only mode available
- Vertex AI or Google AI API
- Free tier available during preview
Cons:
- Newer, less mature than OpenAI
- Built-in VAD only (no semantic VAD)
- Fewer voice options than OpenAI
- Preview features may change
Best for: Cost-effective realtime voice, Google Cloud users
Amazon Nova Sonic
| Model | Speech Input/1K | Speech Output/1K | Text Input/1K | Text Output/1K |
|---|---|---|---|---|
| Nova Sonic | $0.0034 | $0.0136 | $0.00006 | $0.00024 |
| Nova 2 Sonic | Similar | Similar | Similar | Similar |
Per-minute estimate: ~$0.04-0.06/min (~80% cheaper than OpenAI)
Pros:
- Best price/performance ratio
- Very low latency
- 1M token context window (Nova 2)
- Multiple languages (EN, ES, PT, HI)
- Polyglot voices
- AWS ecosystem integration
- Natural conversation flow
Cons:
- Python only (no Node.js)
- AWS-only (Bedrock)
- Fewer customization options
- VAD-based turn detection only
- Newer platform
Best for: AWS users, cost-sensitive production deployments
xAI Grok Voice Agent API
Pros:
- OpenAI Realtime API compatible
- Built-in X (Twitter) search tool
- Web search capabilities
- File/knowledge base search
- Unique personality options
Cons:
- Python only
- Newer platform
- Limited documentation
- Smaller ecosystem
Best for: Apps needing X/social media integration
Ultravox
Pros:
- All-in-one STT+LLM+TTS
- Simple integration
- Quick setup
- Good for prototyping
Cons:
- Python only
- Less control over individual components
- Smaller community
- Limited customization
Best for: Rapid prototyping, simple voice agents
Realtime Feature Comparison
| Feature | OpenAI | Gemini Live | Nova Sonic | xAI Grok | Ultravox |
|---|---|---|---|---|---|
| Python | ✓ | ✓ | ✓ | ✓ | ✓ |
| Node.js | ✓ | ✓ | — | — | — |
| Semantic VAD | ✓ | — | — | ✓ | — |
| Text-only mode | ✓ | ✓ | — | — | — |
| Thinking/Reasoning | — | ✓ | — | — | — |
| Vision input | ✓ | ✓ | — | — | — |
| Tool calling | ✓ | ✓ | ✓ | ✓ | ✓ |
| Affective dialog | — | ✓ | — | — | — |
Realtime Recommendation Guide
| Use Case | Recommendation |
|---|---|
| Lowest cost | Nova Sonic or Gemini Live |
| Best quality | OpenAI Realtime |
| AWS ecosystem | Nova Sonic |
| Google Cloud | Gemini Live |
| Social/X integration | xAI Grok |
| Fastest setup | Ultravox |
| Production-ready | OpenAI or Nova Sonic |
| Node.js required | OpenAI or Gemini Live |
Realtime Cost Comparison (per 10-minute conversation)
Assuming 5 min user speech + 5 min agent speech:
| Provider | Est. Cost |
|---|---|
| Nova Sonic | ~$0.50 |
| Gemini Live | ~$0.40-0.50 |
| OpenAI gpt-realtime | ~$1.50 |
| OpenAI gpt-4o-realtime | ~$3.00 |
Limitations of All Realtime Models
- Delayed transcriptions - User transcripts often arrive after agent response
- No scripted speech - Can’t guarantee exact text output via
say() - History as text only - Loses emotional context when loading history
- Turn detection tradeoffs - Built-in VAD may not be as accurate as pipeline STT
Workaround: Use realtime model with modalities=["text"] + separate TTS for full control over speech output while keeping realtime speech understanding.
Recommended Stack for Low Latency + High Accuracy
If you’re using Claude (Anthropic) as your LLM—which doesn’t have a realtime speech model—you’ll use the traditional STT → LLM → TTS pipeline. Here’s the optimal configuration:
Best Overall Stack
| Component | Recommendation | Why |
|---|---|---|
| STT | Deepgram Flux | Fastest STT, semantic turn detection built-in |
| LLM | Claude 3.5 Sonnet | Best accuracy/speed balance |
| TTS | Cartesia Sonic-3 | Lowest time-to-first-byte |
| VAD | Silero | Fast, lightweight |
| Turn Detection | turn_detection="stt" |
Uses Flux’s semantic endpointing |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from livekit.agents import AgentSession
from livekit.plugins import anthropic, deepgram, cartesia, silero
session = AgentSession(
stt=deepgram.STT(model="nova-3"), # or "flux" for English
llm=anthropic.LLM(
model="claude-sonnet-4-20250514",
temperature=0.7,
),
tts=cartesia.TTS(
model="sonic-english",
voice="79a125e8-cd45-4c13-8a67-188112f4dd22", # British Lady
),
vad=silero.VAD.load(),
turn_detection="stt", # Use Deepgram's semantic turn detection
)
Why These Choices?
STT: Deepgram Flux/Nova-3
- ~100-200ms to first transcript
- Industry-leading accuracy for English
- Semantic turn detection (knows when user is done thinking, not just pausing)
LLM: Claude 3.5 Sonnet
- ~200-400ms time-to-first-token
- Excellent instruction following, nuanced responses
- Parallel tool calls for complex workflows
TTS: Cartesia Sonic
- ~50-100ms time-to-first-audio (fastest in class)
- Natural, expressive voices
- Emotion controls, speed adjustments
Latency Optimization Tips
- Use streaming everywhere - All components should stream
- Sentence-based TTS - Let TTS start on first sentence, not full response
- Turn detection -
turn_detection="stt"reduces false triggers - Keep system prompts short - Fewer input tokens = faster LLM response
- Regional deployment - Deploy agents close to your users
Cost Calculator
Recommended Stack Cost (Deepgram Flux + Claude 3.5 Sonnet + Cartesia)
Per-Minute Costs
| Component | Rate | Per Minute |
|---|---|---|
| Deepgram Flux (STT) | $0.0077/min | $0.0077 |
| Claude 3.5 Sonnet (LLM) | $3 input / $15 output per 1M | ~$0.015-0.03 |
| Cartesia Sonic-3 (TTS) | $50/1M chars | ~$0.04 |
| Total | ~$0.06-0.08/min |
Hourly Costs
| Scenario | STT | LLM | TTS | Total/Hour |
|---|---|---|---|---|
| Light conversation (50% talk time) | $0.23 | $0.50 | $1.20 | ~$1.93 |
| Active conversation (70% talk time) | $0.32 | $0.90 | $1.70 | ~$2.92 |
| Heavy conversation (90% talk time) | $0.42 | $1.20 | $2.20 | ~$3.82 |
Per-Conversation Estimates
| Duration | Est. Cost |
|---|---|
| 5 min call | $0.30 - $0.40 |
| 10 min call | $0.60 - $0.80 |
| 15 min call | $0.90 - $1.20 |
| 30 min call | $1.80 - $2.40 |
Monthly Projections
| Usage | Hours/Month | Monthly Cost |
|---|---|---|
| Light (1 hr/day) | 30 hrs | ~$60-90 |
| Medium (4 hrs/day) | 120 hrs | ~$240-360 |
| Heavy (8 hrs/day) | 240 hrs | ~$480-720 |
| High volume (24/7) | 720 hrs | ~$1,440-2,160 |
Budget Alternative Stack
| Component | Budget Choice | Cost |
|---|---|---|
| STT | AssemblyAI | $0.0025/min |
| LLM | Claude 3.5 Haiku | $0.80/$4 per 1M |
| TTS | Deepgram Aura-2 | $30/1M chars |
This brings cost down to ~$0.80-1.20/hr with slightly higher latency.
Conclusion
Choosing the right voice AI stack depends on your priorities:
- Lowest cost: AssemblyAI + Gemini Flash Lite + Inworld
- Lowest latency: Deepgram Flux + Claude Haiku + Cartesia Sonic
- Best quality: Deepgram Nova-3 + Claude Sonnet + ElevenLabs
- Simplest setup: OpenAI Realtime or Gemini Live (single model)
- AWS ecosystem: Nova Sonic (realtime) or Bedrock models
For most production applications targeting low latency and high accuracy with Claude, the Deepgram + Claude + Cartesia combination offers the best balance at ~$2-3/hour.
Last updated: January 2025. Pricing and features may change—always check the official documentation for the latest information.
Resources: