A curated, developer friendly learning path for building real-time voice AI agents from your first STT call to scaling production telephony.
Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text → LLM → text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order start with the foundations, pick a framework, then drill into individual components and production concerns.
Resources are tagged ???? Beginner, ???? Intermediate, or ???? Advanced. Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.
How to use this list
Read top-to-bottom if you're brand new. The recommended path:
Foundations → understand the pipeline and latency budget Frameworks → pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world Components (STT, TTS, LLM, VAD, turn detection) → swap pieces to learn what each layer does Transport & telephony → connect to a real phone number Evaluation, production, ethics → make it safe enough to ship
Table of contents
1. Foundational concepts and learning paths
Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you'll fight for the rest of your career.
2. Frameworks and orchestration platforms
The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.
Open-source frameworks
LiveKit Agents Voice AI Quickstart Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. ???? Beginner
Pipecat Quickstart Scaffolds a Deepgram + OpenAI + Cartesia pipeline you can talk to in the browser in 5 minutes. ???? Beginner
Ultravox (fixie-ai/ultravox) Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. ???? Advanced
Managed platforms
Realtime / speech-to-speech APIs
OpenAI Realtime API Guide Official guide to gpt-realtime over WebRTC, WebSockets, or SIP. ???? Intermediate
over WebRTC, WebSockets, or SIP. Google Gemini Live API Overview Low-latency, bidirectional voice + vision agents with barge-in and tool use. ???? Intermediate
Twilio ConversationRelay WebSocket bridge that handles STT/TTS so you focus on LLM logic; works with any LLM. ???? Intermediate
Vendor-neutral comparisons
Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases.
Commercial APIs
Open source
openai/whisper The original repo and the de facto starting point for any DIY ASR project. ???? Beginner
SYSTRAN/faster-whisper CTranslate2 reimplementation up to 4× faster with INT8; recommended for self-hosted Whisper. ???? Intermediate
NVIDIA NeMo (Parakeet / Canary) Top-of-leaderboard open ASR models with streaming inference recipes. ???? Advanced
Moonshine Tiny on-device ASR (~190 MB) optimized for live streaming on edge devices. ???? Intermediate
Benchmarks and explainers
Latency, not raw quality, is what kills voice agents prioritize providers offering true streaming with first-byte under 200 ms.
Commercial APIs
Open source
Coqui TTS (idiap fork) Maintained fork of Coqui-TTS / XTTS v2; the most battle-tested OSS TTS toolkit. ???? Intermediate
Piper (OHF-Voice/piper1-gpl) Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. ???? Beginner
Kokoro 82M Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. ???? Beginner
F5-TTS Diffusion-transformer TTS with high-quality zero-shot voice cloning. ???? Intermediate
Orpheus-TTS Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. ???? Intermediate
Sesame CSM Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. ???? Advanced
Streaming and ethics
5. LLMs for voice and real-time AI
A voice agent's perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.
Low-latency inference
Groq LPU-based inference cloud delivering ~10× faster Llama tokens/sec than commodity GPUs. ???? Beginner
Cerebras Inference Wafer-scale chip inference with very high throughput on Llama models. ???? Beginner
SambaNova Cloud Reconfigurable Dataflow inference; stable throughput at low latency. ???? Beginner
Speech-to-speech models
OpenAI Realtime API guide Flagship S2S product with WebRTC/WebSocket transport. ???? Intermediate
Google Gemini Live Real-time multimodal voice/video with barge-in and 70-language support. ???? Intermediate
Moshi (kyutai-labs) Open-source full-duplex speech-text foundation model with 200 ms latency the premier OSS S2S model to study. ???? Advanced
Voice-specific prompting and tools
6. Voice activity detection and turn-taking
Pure VAD is no longer enough modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.
7. WebRTC fundamentals
WebRTC is the default transport for voice agents that don't run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.
8. Telephony and SIP
The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.
9. Tutorials and hands-on projects
Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.
10. GitHub starter repos and awesome lists
Clone these instead of writing boilerplate from scratch.
livekit/agents The flagship open-source Python/Node framework for production voice agents. ???? → ????
pipecat-ai/pipecat Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. ???? → ????
livekit-examples/agent-starter-python Production-ready starter with Dockerfile, eval suite, turn detector, and core plugins. ???? Beginner
livekit-examples (org) Official collection of LiveKit Python/React/Swift/Android starters. ???? Beginner
pipecat-ai/pipecat-examples Sample apps for push-to-talk, websocket, telephony, and multimodal use cases. ???? → ????
elevenlabs/elevenlabs-examples Runnable Next.js and Python examples for TTS, STT, and real-time agents. ???? Beginner
vocodedev/vocode-core Open-source modular framework for voice-LLM agents on phone, Zoom, or system audio. ???? Intermediate (less actively maintained than LiveKit/Pipecat)
(less actively maintained than LiveKit/Pipecat) kwindla/macos-local-voice-agents Pipecat example hitting sub-800 ms voice-to-voice latency entirely on M-series Macs. ???? Intermediate
zzw922cn/awesome-speech-recognition-speech-synthesis-papers Comprehensive curated index of ASR, TTS, voice conversion, and speech-LLM papers. ???? Intermediate
wildminder/awesome-ai-voice Up-to-date 2025–2026 list of open-source TTS and voice-cloning models.
CorentinJ/Real-Time-Voice-Cloning Classic 5-second voice cloning project for understanding TTS fundamentals. ???? Intermediate
11. Datasets and benchmarks
You'll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.
12. Beginner-accessible research papers
These are the landmark papers behind the models you'll actually use. Read the Whisper and Common Voice papers first they're unusually approachable.
13. Evaluation and testing
You can't ship what you can't measure. Voice-agent evaluation is fundamentally probabilistic a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.
14. Production, deployment, and scaling
Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.
15. Ethics, safety, and regulation
If you're shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.
16. Blogs and newsletters
Subscribe to two or three to stay current the field moves quickly.
17. Podcasts
18. Communities
19. Conferences and events
20. Hackathons and competitions
Suggested learning path
Week 1 Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 7). Week 2 First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 9). Week 3 Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5). Week 4 Turn-taking & telephony: Add Silero VAD and a turn detector; connect a SIP trunk (sections 6, 8). Week 5 Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 13, 14, 15). Ongoing: Subscribe to two newsletters and join voice ai community in linkedin (sections 16, 17, 18).
Contributing
Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals.