Glossary

Voice AI Stack (ASR, STT, LLM, TTS)

The voice AI stack is a 4-part pipeline: ASR/STT (speech to text), NLU/LLM (understanding), TTS (text to speech), and the orchestration layer.

All terms

Definition

The voice AI stack is the pipeline that turns a spoken conversation into an AI-powered one. It has four parts: ASR/STT converts the user's speech to text; the LLM (large language model) understands the intent, decides the response, and calls tools; TTS converts the LLM's text response back to speech; and the orchestration layer glues it all together in real time, handling turn detection, interruptions, barge-in, and the sub-800ms latency budget required for natural conversation.

ASR / STT — Automatic Speech Recognition

Also called Speech-to-Text. Converts audio into text tokens as the user speaks. Leading providers for Indian languages include Sarvam Saaras and Deepgram Nova. Latency target: <200ms.

LLM — Large Language Model

The reasoning brain. Takes the transcribed text, applies your system prompt, retrieves context from your knowledge base (RAG), decides what to say, and optionally calls tools. Examples: GPT-4o, Claude Sonnet, Groq GPT-OSS, Sarvam-M. Latency target: <400ms to first token.

TTS — Text-to-Speech

Converts the LLM's text reply into audio. Leading providers: ElevenLabs, Cartesia, Sarvam Bulbul. Streaming TTS starts playing audio before the full text is generated. Latency target: <300ms to first audio byte.

Orchestration layer

This is where voice AI gets hard. The orchestration layer handles voice activity detection (is the user still speaking?), turn detection (when to start responding), interruptions (when the user cuts in mid-reply), barge-in, and the whole concurrency and state management of a real-time conversation. ThinnestAI provides this layer, built on LiveKit's real-time engine underneath.

More definitions

Voice AI Agent

A voice AI agent is an AI system that holds real-time spoken conversations via phone, web or SIP — combining speech recognition, an LLM and speech synthesis.

Voice AI

Voice AI is the umbrella term for AI that understands and generates human speech in real time — powering voice assistants, phone agents and translation.

Conversational AI

Conversational AI is the category of AI that interacts with humans in natural language across chat, voice, email and messaging — using NLU, LLMs and tools.

IVR vs Voice AI

IVR is a rigid scripted tree (press 1 for sales). Voice AI is a natural-language agent that understands free-form speech, reasons and calls tools.

BYOK (Bring Your Own Key)

BYOK lets you bring your own LLM, STT and TTS API keys — the voice AI platform routes usage through your accounts instead of bundling provider costs.

BYON (Bring Your Own Number)

BYON lets you bring your own phone number — via Twilio, Vobiz or Exotel — and connect it to the voice AI platform via SIP instead of renting one.

See all glossary entries