What is voice ai latency?

Glossary

Voice AI Latency

Voice AI latency is the time between the user finishing a sentence and hearing the agent respond — the most important quality metric for voice AI.

All terms

Definition

Voice AI latency is the end-to-end delay between when a user finishes speaking and when they hear the voice AI agent begin to reply. For natural-feeling conversation the target is 500–800ms total, which is split across STT (~100–200ms), LLM time-to-first-token (~150–400ms), TTS time-to-first-audio (~150–300ms), and network round-trip. Above 1.2 seconds conversation feels robotic; above 2 seconds the agent feels broken.

Why it matters

Humans perceive a pause of 800ms+ as a stall. A voice AI agent with 1.5s latency is usable but feels obviously artificial. A voice AI agent with 400ms latency feels like talking to a sharp human — which is the whole point of voice AI in the first place.

How to reduce latency

Streaming STT: begin transcribing while the user is still speaking, not after.
Streaming LLM: use an LLM that starts emitting tokens within 150ms. Groq-hosted models are strongest here.
Streaming TTS: pipe tokens into TTS as the LLM generates them, not after. TTFA (time-to-first-audio) <300ms is achievable with providers like ElevenLabs Turbo and Cartesia Sonic.
Regional deployment: host as close to your user as possible. For India, Mumbai region is the default.
Provider routing: pick the fastest provider for each layer of the stack per workload.

More definitions

Voice AI Agent

A voice AI agent is an AI system that holds real-time spoken conversations via phone, web or SIP — combining speech recognition, an LLM and speech synthesis.

Voice AI

Voice AI is the umbrella term for AI that understands and generates human speech in real time — powering voice assistants, phone agents and translation.

Conversational AI

Conversational AI is the category of AI that interacts with humans in natural language across chat, voice, email and messaging — using NLU, LLMs and tools.

IVR vs Voice AI

IVR is a rigid scripted tree (press 1 for sales). Voice AI is a natural-language agent that understands free-form speech, reasons and calls tools.

BYOK (Bring Your Own Key)

BYOK lets you bring your own LLM, STT and TTS API keys — the voice AI platform routes usage through your accounts instead of bundling provider costs.

BYON (Bring Your Own Number)

BYON lets you bring your own phone number — via Twilio, Vobiz or Exotel — and connect it to the voice AI platform via SIP instead of renting one.

See all glossary entries