DeepgramBacked by Deepgram Startup ProgramLearn more

Voice AI Stack (ASR, STT, LLM, TTS)

The voice AI stack is a pipeline of four components: ASR/STT (speech to text), NLU/LLM (language understanding), TTS (text to speech), and the orchestration layer that glues them together in real time.

Definition

The voice AI stack is the pipeline that turns a spoken conversation into an AI-powered one. It has four parts: ASR/STT converts the user's speech to text; the LLM (large language model) understands the intent, decides the response, and calls tools; TTS converts the LLM's text response back to speech; and the orchestration layer glues it all together in real time, handling turn detection, interruptions, barge-in, and the sub-800ms latency budget required for natural conversation.

ASR / STT — Automatic Speech Recognition

Also called Speech-to-Text. Converts audio into text tokens as the user speaks. Leading providers for Indian languages include Sarvam Saaras and Deepgram Nova. Latency target: <200ms.

LLM — Large Language Model

The reasoning brain. Takes the transcribed text, applies your system prompt, retrieves context from your knowledge base (RAG), decides what to say, and optionally calls tools. Examples: GPT-4o, Claude Sonnet, Groq GPT-OSS, Sarvam-M. Latency target: <400ms to first token.

TTS — Text-to-Speech

Converts the LLM's text reply into audio. Leading providers: ElevenLabs, Cartesia, Sarvam Bulbul. Streaming TTS starts playing audio before the full text is generated. Latency target: <300ms to first audio byte.

Orchestration layer

This is where voice AI gets hard. The orchestration layer handles voice activity detection (is the user still speaking?), turn detection (when to start responding), interruptions (when the user cuts in mid-reply), barge-in, and the whole concurrency and state management of a real-time conversation. ThinnestAI provides this layer, built on LiveKit's real-time engine underneath.