What's the latency on GPT Realtime vs Gemini Live?

Both sit in the 200–400 ms first-token range under typical conditions. GPT Realtime tends to win on the response quality side (more natural pacing, fewer awkward turn-takes) thanks to semantic VAD. Gemini Live tends to win on absolute first-token latency on the cheapest models. Neither difference is large enough to be the only factor — pick the model whose voices, pricing, and instruction-following you prefer.

When should I pick GPT Realtime Mini over GPT Realtime?

Mini is ~60% cheaper per audio token. For high-volume outbound (lead qualification, surveys, scripted callbacks) and any flow where the conversation is short and structured, Mini gives you most of the quality at a fraction of the cost. For nuanced inbound (sales discovery, escalated support, multi-turn problem solving) the full GPT Realtime model is worth the price difference.

Can I use my own OpenAI key for billing?

Yes — that's the BYOK path. Paste a key from platform.openai.com/api-keys into the Hosting card. The key prefix (sk-) is validated in-browser, and the key is stored encrypted. With BYOK, the realtime model usage bills to your OpenAI account directly and we charge only the platform fee. Use this when you have an existing OpenAI relationship or want a unified spend dashboard.

What's the difference between semantic VAD and server VAD?

Server VAD watches the audio waveform and marks end-of-turn when energy drops below a threshold for a configured silence duration. It's predictable but chunks hesitant callers mid-thought. Semantic VAD instead uses a classifier on the words themselves to decide the speaker is done. It's much friendlier to natural speech — but slightly slower to commit. Default to semantic for inbound conversation; switch to server VAD with a higher threshold for noisy scripted outbound.

Do I need a separate OpenAI account?

No. The Platform hosting mode uses our managed OpenAI key and bills through your thinnestAI plan — works on the free tier and every paid tier. BYOK is optional and only needed if you want to route GPT Realtime usage to your own OpenAI account (typically for enterprise consolidation or to use Pro-tier rate limits).

Back to Blog

GPT Realtime in Your Voice Agents: Semantic VAD, BYOK, and a Cheaper Mini

Thinnest AI Team

May 11, 2026• 5 min read

GPT Realtime in Your Voice Agents: Semantic VAD, BYOK, and a Cheaper Mini

GPT Realtime, end-to-end

The realtime model field, in 2026

It's worth admitting the obvious upfront: in 2026, there's more than one excellent realtime voice model. Gemini Live ships fast turn-taking and 30 native voices at a friendly price. OpenAI's GPT Realtime ships best-in-class instruction following, semantic-VAD turn detection, and the new marin / cedar voices on a native-audio pipeline. Builders shouldn't have to pick a platform that locks them into one camp.

Yesterday thinnestAI shipped Speech-to-Speech with Gemini Live. Today we're adding OpenAI Realtime as a first-class option in the same S2S tab — same one-toggle workflow, same BYOK plumbing, same half-cascade hatch, and one consolidated cost-breakdown row that reshapes itself based on the model you picked.

Two models, one sidebar

The S2S tab has a left-column model list that mirrors the STT and TTS tabs. Two OpenAI entries sit alongside the two Gemini ones:

GPT Realtime (GA) — OpenAI's production realtime model on the new native-audio pipeline. Semantic-VAD turn detection by default, full tool calling, the marin and cedar GA voices tuned for this pipeline.
GPT Realtime Mini — roughly 60% cheaper for high-volume or cost-sensitive voice flows. Same shape as the full model.

Click a model and the right-hand settings panel reshapes itself: Gemini-only cards (Vertex AI hosting, Affective Dialog, Proactivity) collapse, the OpenAI Turn Detection card appears, and the voice catalog swaps from 30 Gemini voices to OpenAI's 10. The temperature slider re-clamps to OpenAI's accepted 0.6–1.2 band so you can't accidentally save an out-of-range value.

Voices — marin, cedar, and the legacy eight

OpenAI ships two GA voices alongside gpt-realtime's native-audio launch:

marin — natural, conversational. Default.
cedar — warm, grounded. Good for support / empathy use cases.

Plus the eight legacy voices inherited from the earlier Realtime previews: alloy, ash, ballad, coral, echo, sage, shimmer, verse. The picker is a single searchable dropdown grouped into "GA voices" and "Legacy voices" so the new defaults are easy to find.

Turn Detection — semantic, server, or plugin default

OpenAI Realtime ships two turn-detection modes. Our new Turn Detection card surfaces both, plus a "plugin default" pill for when you want to defer to the LiveKit + OpenAI tuned default:

Semantic VAD (default) — uses a classifier over the caller's words to decide they're done speaking. Far less likely to chunk mid-sentence than silence-based VAD. An eagerness pill controls how aggressively the model commits to end-of-turn: auto (the tuned default), low (lets callers take their time), medium, or high (chunks audio as soon as possible). Pick low for elderly or hesitant callers; high for high-throughput dialer-style flows where you need snappy turns.
Server VAD — the classic silence-based mode. Three sliders: threshold (energy bar for "this is speech"), prefix padding (ms of audio kept before detected speech, so plosives and soft starts aren't clipped), and silence duration (ms of silence required to mark end-of-turn).
Plugin default — we don't send a turn-detection config at all. The LiveKit plugin uses its built-in default. Useful as a baseline when you're A/B testing.

Two more switches apply to both modes: auto-generate reply (the model speaks as soon as turn detection fires) and allow interruption (caller speech during the agent's reply cuts the response). Both default on. A single reset button on the card restores LiveKit + OpenAI's recommended defaults.

BYOK with your own OpenAI key

Same as Gemini, OpenAI gets a Bring-Your-Own-Key path. Paste a key from platform.openai.com/api-keys into the Hosting card and Realtime API usage bills to your OpenAI account — we charge only the platform fee.

The BYO key field validates the sk-… prefix in-browser. If you paste an AI Studio Gemini key (AIza…) by mistake, a warning surfaces immediately. Keys are stored encrypted and never logged.

Vertex AI hosting is a Gemini-only option, so it's hidden when an OpenAI model is selected. Switch back to Gemini and it returns.

Half-cascade with GPT Realtime

The half-cascade pattern (realtime model listens + reasons, your TTS speaks) works the same on OpenAI as on Gemini. Toggle Use custom TTS in the S2S tab and GPT Realtime runs in ["text"] modality, with the reply played through whatever cascaded TTS plugin you have configured on the agent — Cartesia, ElevenLabs, Sarvam, Aero TTS.

Unlike Gemini's native-audio models, both GPT Realtime variants accept TEXT modality, so the half-cascade toggle stays available on every OpenAI model.

Cost breakdown — provider-aware row

In S2S mode the cost-breakdown popup collapses STT + TTS + LLM into a single row. The row label tracks the selected provider:

GPT Realtime at $32 / $64 per million audio in/out tokens — roughly ₹9.05 per minute on a typical call.
GPT Realtime Mini at $13 / $26 per million — roughly ₹3.65 per minute.
BYOK (with your own OpenAI key) zeroes that row on our side. You pay OpenAI directly; we only charge the platform fee.

The half-cascade path restores the TTS row alongside the realtime row so the breakdown shows what each component contributes.

Greeting behaviour, by configuration

Pure S2S, GPT Realtime — the agent speaks the configured greeting on its first turn. The greeting text rides in on the initial generate_reply trigger (the LiveKit OpenAI plugin doesn't accept instructions at construction, so we inline the opening line in the trigger payload to keep behaviour consistent with Gemini).
S2S half-cascade — the greeting plays through your cascaded TTS plugin via session.say(), same as the classic flow.

When GPT Realtime is the right call

You want OpenAI-grade instruction following on every turn.
Your callers benefit from semantic VAD — conversation-heavy support, sales discovery, anything where chunking mid-sentence hurts.
You have an OpenAI Enterprise / Pro relationship and want consolidated billing.
You want the production native-audio pipeline (marin / cedar) rather than the older preview voices.

If cost-per-minute is the top constraint, look at GPT Realtime Mini (~₹3.65/min uncached) or Gemini Live (~₹6.80/min). If you need agent handoffs and parallel tool calls, every option on the S2S tab supports them (except experimental Gemini 3.1, which we flag explicitly).

Try it in 90 seconds

Open any voice agent in Agent Studio → Voice Configuration.
Click Speech-to-Speech in the top-right of the header.
The S2S tab opens with the model sidebar on the left. Click GPT Realtime under "OpenAI Realtime".
Leave Voice on marin (or pick cedar for a warmer feel).
(Optional) Paste your OpenAI key into the Hosting card so usage bills to your OpenAI account.
(Optional) Tune the Turn Detection card — Semantic VAD / eagerness=low for hesitant callers; Server VAD with a higher threshold for noisy environments.
Save and click Try Voice Call.

Try GPT Realtime S2S Free →

Free trial includes 5 voice minutes · No credit card required · BYOK supported on every plan.

GPT Realtime in Your Voice Agents: Semantic VAD, BYOK, and a Cheaper Mini

The realtime model field, in 2026

Two models, one sidebar

Voices — marin, cedar, and the legacy eight

Turn Detection — semantic, server, or plugin default

BYOK with your own OpenAI key

Half-cascade with GPT Realtime

Cost breakdown — provider-aware row

Greeting behaviour, by configuration

When GPT Realtime is the right call

Try it in 90 seconds

Frequently Asked Questions

Related documentation

Subscribe to our newsletter

Related reading

Platform

Docs