DeepgramBacked by Deepgram Startup ProgramLearn more
Back to Blog

Gemini Live in Your Voice Agents: Speech-to-Speech the Way It Should Be

T
Thinnest AI Team
May 11, 2026 5 min read
Gemini Live in Your Voice Agents: Speech-to-Speech the Way It Should Be
Gemini Live, end-to-end

The pipeline tax on voice AI

The classic voice agent runs three networks back-to-back: STT transcribes the caller, an LLM reasons about a reply, and TTS synthesises audio. Each hop is its own round trip and its own failure mode. Even with the fastest providers, you're stacking 200–400 ms of latency before the model has spoken a single word.

Speech-to-Speech collapses that into one model. The caller's audio goes in, the agent's audio comes out, and the model handles everything in between — including the bits the cascaded stack treats as plumbing: voice activity detection, end-of-turn prediction, interruption handling, even the choice of when to stay silent.

On thinnestAI, that toggle is now live for Google Gemini Live.

What "Speech-to-Speech with Gemini Live" actually means

In your agent's Voice Configuration screen there's a header toggle: Cascaded ↔ Speech-to-Speech. Flip it to S2S and a new tab appears between TTS and Detection. The cascaded STT and TTS tabs go away — in pure S2S, Gemini Live owns those layers.

Inside the S2S tab you pick:

  • A realtime model — Gemini 2.5 Flash (the recommended native-audio default) or Gemini 3.1 Flash (experimental, lowest-latency, single-turn).
  • A voice from the 30-voice catalog — Puck, Kore, Charon, Sulafat, Achird, Despina, and 24 more, grouped by character (Popular, Bright & Lively, Warm & Friendly, Smooth & Even, Clear & Informative, Firm & Mature).
  • Sampling — temperature (0.0–2.0) and a max-output-tokens cap.
  • Conversational tuning — two switches available on Gemini 2.5: Affective Dialog adjusts the agent's tone to match the caller's emotional state (great for support / empathy use cases), and Proactivity lets the model stay silent on background chatter instead of replying to every cough.

Three hosting modes — including BYOK

One of the things builders kept asking for was a way to use Gemini Live without routing payment through us. The Hosting card has three mutually-exclusive options:

  • Platform (default) — our managed Gemini API key, billed through your thinnestAI plan.
  • BYO Gemini API key — paste a key from aistudio.google.com/app/apikey and Gemini Live usage bills to your Google account. We charge only the platform fee.
  • Vertex AI — routes through your own GCP project for enterprise / data-residency. Requires GOOGLE_APPLICATION_CREDENTIALS on the deployment; project and region are optional overrides.

The BYO key field validates the AIza… prefix in-browser, so a misplaced OpenAI key can't accidentally end up on the Gemini side. Keys are stored encrypted and never logged.

Half-cascade — keep Gemini for listening, swap your own TTS

You don't have to commit to Gemini's voices. The S2S tab has a Use custom TTS toggle that runs Gemini Live in TEXT modality and pipes the reply through your selected cascaded TTS plugin — Cartesia, ElevenLabs, Sarvam, Aero TTS, anything we support.

This is the "half-cascade" pattern: Gemini's reasoning + endpointing on the input side, your brand voice on the output side. It works on Gemini 3.1 Flash and on non-native-audio variants. The native-audio Gemini 2.5 default reject TEXT modality at the API level, so we disable the toggle there — switch models and it lights back up.

Thinking mode — chain-of-thought when you want it

Native-audio Gemini models think before they speak. By default that chain-of-thought stays internal — your transcripts stay clean. There's a one-line toggle to forward thoughts as transcript text when you're debugging reasoning. Off by default; one reset button restores defaults across the card.

Custom turn detection — bring your own STT on the input side

Gemini Live's built-in VAD is good, but not always the right tool. Callers in noisy households, contact centres with background chatter, agents that need a specific STT's accent coverage — for those cases there's a new Custom Turn Detection switch in the S2S tab.

Flip it on and three things happen:

  • Gemini Live's automatic activity detection is disabled at construction (we set realtime_input_config.automatic_activity_detection.disabled=True).
  • The cascaded STT tab reappears in the tab bar so you can pick Deepgram Nova-3, Sarvam, AssemblyAI, or any other STT plugin we support.
  • The Detection tab reappears so you can choose the turn-detection mode (LiveKit's MultilingualModel, STT-endpointing, or VAD-only) and tune the endpointing thresholds.

Gemini Live still speaks the reply on the output side. The input pipeline is yours.

Cost breakdown — one row, transparent rates

When S2S is on, the cost-breakdown popup collapses STT + TTS + LLM into a single "Gemini Live" row. Pure native-audio at $3 / $12 per million audio in/out tokens works out to about ₹6.80 per minute on a typical call. With BYOK (Vertex or BYO key) we mark that row as Free — you pay Google directly and we only charge the platform fee.

Half-cascade restores the TTS row alongside Gemini Live so you can see exactly what each component costs.

Greeting behaviour, by configuration

  • Pure S2S, Gemini 2.5 — Gemini Live speaks the configured greeting on its first turn. The greeting text is baked into the model's system instructions as an opening-line block, so the first response is unambiguous instead of "I'm sorry, I didn't get that."
  • Pure S2S, Gemini 3.1 — Gemini 3.1 doesn't support agent-initiated turns. The user must speak first. Switch to Gemini 2.5 or enable half-cascade if you need an agent-initiated greeting.
  • S2S half-cascade — the greeting plays through your cascaded TTS plugin, same as the classic flow.

When Gemini Live is the right call

  • You want the lowest turn-around latency available right now.
  • The call is a focused conversation — booking, FAQ, lead capture, simple support — rather than a heavy mid-call prompt-mutation workflow.
  • You're happy with Gemini's built-in voices, or you want to half-cascade onto your own TTS.
  • You have a Google Cloud relationship and want to consolidate spend on Vertex AI.

If you need agent handoffs, parallel tool calls, or mid-session prompt updates, stay on Gemini 2.5 — Gemini 3.1's experimental release disables those.

Try it in 90 seconds

  1. Open any voice agent in Agent Studio → Voice Configuration.
  2. Click Speech-to-Speech in the top-right of the header.
  3. The S2S tab opens with the model sidebar on the left. Leave Gemini 2.5 Flash selected.
  4. (Optional) Paste your AI Studio key into the Hosting card so usage bills to your Google account.
  5. Save and click Try Voice Call.

You should hear the difference on the first turn.

Try Gemini Live S2S Free →

Free trial includes 5 voice minutes · No credit card required · BYOK supported on every plan.

Frequently Asked Questions

Subscribe to our newsletter

Get the latest AI updates delivered directly to your inbox.