What's the latency improvement vs cascaded?

Cascaded voice agents typically run at 600–900 ms end-to-end first-token latency once STT, LLM, and TTS round-trips stack up. Gemini Live in S2S mode collapses those into one connection — typical first-token latencies sit in the 200–400 ms range depending on network and model variant. The bigger win in practice is turn-handling: native VAD and end-of-turn detection feel measurably more natural than even a well-tuned cascaded stack.

Can I use my own Google Cloud project?

Yes. Toggle Vertex AI in the Hosting card and Gemini Live routes through your GCP project instead of our managed key. You'll need GOOGLE_APPLICATION_CREDENTIALS (a service-account JSON path) set on the deployment. Project and region are optional overrides if your service account file doesn't already pin them.

What's the difference between BYO Gemini API key and Vertex AI?

BYO Gemini API key bills through the AI Studio path — paste an AIza… key and you pay Google directly via your AI Studio account. Vertex AI routes the same model through your GCP project, which gives you GCP's data-residency and IAM controls plus enterprise billing. Both zero our model-side bill so we only charge the platform fee. Pick BYO key for solo / small-team setups; Vertex for regulated or enterprise deployments.

Why does my agent stay silent on Gemini 3.1?

Gemini 3.1 Flash Live Preview doesn't support generate_reply() — the agent can't initiate the first turn. The caller has to speak first. We log a warning when this happens. Switch to Gemini 2.5 Flash for agent-initiated greetings, or enable half-cascade (Use custom TTS in the S2S tab) so the cascaded TTS path plays the greeting.

Can I use Gemini for listening and a different TTS for voice?

Yes — that's the half-cascade mode. Enable Use custom TTS in the S2S tab. Gemini Live runs in TEXT modality (listens + reasons) and the reply plays through whatever TTS plugin you have configured on the agent (Cartesia, ElevenLabs, Sarvam, Aero TTS, etc). Note that native-audio Gemini models (the 2.5 default) reject TEXT modality — the toggle is disabled there. Switch to Gemini 3.1 Flash to enable half-cascade.

Back to Blog

Gemini Live in Your Voice Agents: Speech-to-Speech the Way It Should Be

Thinnest AI Team

May 11, 2026• 5 min read

Gemini Live in Your Voice Agents: Speech-to-Speech the Way It Should Be

Gemini Live, end-to-end

The pipeline tax on voice AI

The classic voice agent runs three networks back-to-back: STT transcribes the caller, an LLM reasons about a reply, and TTS synthesises audio. Each hop is its own round trip and its own failure mode. Even with the fastest providers, you're stacking 200–400 ms of latency before the model has spoken a single word.

Speech-to-Speech collapses that into one model. The caller's audio goes in, the agent's audio comes out, and the model handles everything in between — including the bits the cascaded stack treats as plumbing: voice activity detection, end-of-turn prediction, interruption handling, even the choice of when to stay silent.

On thinnestAI, that toggle is now live for Google Gemini Live.

What "Speech-to-Speech with Gemini Live" actually means

In your agent's Voice Configuration screen there's a header toggle: Cascaded ↔ Speech-to-Speech. Flip it to S2S and a new tab appears between TTS and Detection. The cascaded STT and TTS tabs go away — in pure S2S, Gemini Live owns those layers.

Inside the S2S tab you pick:

A realtime model — Gemini 2.5 Flash (the recommended native-audio default) or Gemini 3.1 Flash (experimental, lowest-latency, single-turn).
A voice from the 30-voice catalog — Puck, Kore, Charon, Sulafat, Achird, Despina, and 24 more, grouped by character (Popular, Bright & Lively, Warm & Friendly, Smooth & Even, Clear & Informative, Firm & Mature).
Sampling — temperature (0.0–2.0) and a max-output-tokens cap.
Conversational tuning — two switches available on Gemini 2.5: Affective Dialog adjusts the agent's tone to match the caller's emotional state (great for support / empathy use cases), and Proactivity lets the model stay silent on background chatter instead of replying to every cough.

Three hosting modes — including BYOK

One of the things builders kept asking for was a way to use Gemini Live without routing payment through us. The Hosting card has three mutually-exclusive options:

Platform (default) — our managed Gemini API key, billed through your thinnestAI plan.
BYO Gemini API key — paste a key from aistudio.google.com/app/apikey and Gemini Live usage bills to your Google account. We charge only the platform fee.
Vertex AI — routes through your own GCP project for enterprise / data-residency. Requires GOOGLE_APPLICATION_CREDENTIALS on the deployment; project and region are optional overrides.

The BYO key field validates the AIza… prefix in-browser, so a misplaced OpenAI key can't accidentally end up on the Gemini side. Keys are stored encrypted and never logged.

Half-cascade — keep Gemini for listening, swap your own TTS

You don't have to commit to Gemini's voices. The S2S tab has a Use custom TTS toggle that runs Gemini Live in TEXT modality and pipes the reply through your selected cascaded TTS plugin — Cartesia, ElevenLabs, Sarvam, Aero TTS, anything we support.

This is the "half-cascade" pattern: Gemini's reasoning + endpointing on the input side, your brand voice on the output side. It works on Gemini 3.1 Flash and on non-native-audio variants. The native-audio Gemini 2.5 default reject TEXT modality at the API level, so we disable the toggle there — switch models and it lights back up.

Thinking mode — chain-of-thought when you want it

Native-audio Gemini models think before they speak. By default that chain-of-thought stays internal — your transcripts stay clean. There's a one-line toggle to forward thoughts as transcript text when you're debugging reasoning. Off by default; one reset button restores defaults across the card.

Custom turn detection — bring your own STT on the input side

Gemini Live's built-in VAD is good, but not always the right tool. Callers in noisy households, contact centres with background chatter, agents that need a specific STT's accent coverage — for those cases there's a new Custom Turn Detection switch in the S2S tab.

Flip it on and three things happen:

Gemini Live's automatic activity detection is disabled at construction (we set realtime_input_config.automatic_activity_detection.disabled=True).
The cascaded STT tab reappears in the tab bar so you can pick Deepgram Nova-3, Sarvam, AssemblyAI, or any other STT plugin we support.
The Detection tab reappears so you can choose the turn-detection mode (LiveKit's MultilingualModel, STT-endpointing, or VAD-only) and tune the endpointing thresholds.

Gemini Live still speaks the reply on the output side. The input pipeline is yours.

Cost breakdown — one row, transparent rates

When S2S is on, the cost-breakdown popup collapses STT + TTS + LLM into a single "Gemini Live" row. Pure native-audio at $3 / $12 per million audio in/out tokens works out to about ₹6.80 per minute on a typical call. With BYOK (Vertex or BYO key) we mark that row as Free — you pay Google directly and we only charge the platform fee.

Half-cascade restores the TTS row alongside Gemini Live so you can see exactly what each component costs.

Greeting behaviour, by configuration

Pure S2S, Gemini 2.5 — Gemini Live speaks the configured greeting on its first turn. The greeting text is baked into the model's system instructions as an opening-line block, so the first response is unambiguous instead of "I'm sorry, I didn't get that."
Pure S2S, Gemini 3.1 — Gemini 3.1 doesn't support agent-initiated turns. The user must speak first. Switch to Gemini 2.5 or enable half-cascade if you need an agent-initiated greeting.
S2S half-cascade — the greeting plays through your cascaded TTS plugin, same as the classic flow.

When Gemini Live is the right call

You want the lowest turn-around latency available right now.
The call is a focused conversation — booking, FAQ, lead capture, simple support — rather than a heavy mid-call prompt-mutation workflow.
You're happy with Gemini's built-in voices, or you want to half-cascade onto your own TTS.
You have a Google Cloud relationship and want to consolidate spend on Vertex AI.

If you need agent handoffs, parallel tool calls, or mid-session prompt updates, stay on Gemini 2.5 — Gemini 3.1's experimental release disables those.

Try it in 90 seconds

Open any voice agent in Agent Studio → Voice Configuration.
Click Speech-to-Speech in the top-right of the header.
The S2S tab opens with the model sidebar on the left. Leave Gemini 2.5 Flash selected.
(Optional) Paste your AI Studio key into the Hosting card so usage bills to your Google account.
Save and click Try Voice Call.

You should hear the difference on the first turn.

Try Gemini Live S2S Free →

Free trial includes 5 voice minutes · No credit card required · BYOK supported on every plan.

Gemini Live in Your Voice Agents: Speech-to-Speech the Way It Should Be

The pipeline tax on voice AI

What "Speech-to-Speech with Gemini Live" actually means

Three hosting modes — including BYOK

Half-cascade — keep Gemini for listening, swap your own TTS

Thinking mode — chain-of-thought when you want it

Custom turn detection — bring your own STT on the input side

Cost breakdown — one row, transparent rates

Greeting behaviour, by configuration

When Gemini Live is the right call

Try it in 90 seconds

Frequently Asked Questions

Related documentation

Subscribe to our newsletter

Related reading

Platform

Docs