Skip to main content

Overview

GeminiLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences with multimodal capabilities including audio, video, and text processing.
Want to start building? Check out our Gemini Live Guide.

Installation

To use Gemini Live services, install the required dependencies:
pip install "pipecat-ai[google]"

Prerequisites

Google AI Setup

Before using Gemini Live services, you need:
  1. Google Account: Set up at Google AI Studio
  2. API Key: Generate a Gemini API key from AI Studio
  3. Model Access: Ensure access to Gemini Live models
  4. Multimodal Configuration: Set up audio, video, and text modalities

Required Environment Variables

  • GOOGLE_API_KEY: Your Google Gemini API key for authentication

Key Features

  • Multimodal Processing: Handle audio, video, and text inputs simultaneously
  • Real-time Streaming: Low-latency audio and video processing
  • Voice Activity Detection: Automatic speech detection and turn management
  • Function Calling: Advanced tool integration and API calling capabilities
  • Context Management: Intelligent conversation history and system instruction handling

Configuration

GeminiLiveLLMService

api_key
str
required
Google AI API key for authentication.
model
str
deprecated
Gemini model identifier to use.Deprecated in v0.0.105. Use settings=GeminiLiveLLMService.Settings(model=...) instead.
voice_id
str
default:"Charon"
deprecated
TTS voice identifier for audio responses.Deprecated in v0.0.105. Use settings=GeminiLiveLLMService.Settings(voice=...) instead.
system_instruction
str
default:"None"
System prompt for the model. Can also be provided via the LLM context.
tools
List[dict] | ToolsSchema
default:"None"
Tools/functions available to the model. Can also be provided via the LLM context.
params
InputParams
default:"InputParams()"
deprecated
Runtime-configurable generation and session settings. See InputParams below.Deprecated in v0.0.105. Use settings=GeminiLiveLLMService.Settings(...) instead.
settings
GeminiLiveLLMService.Settings
default:"None"
Runtime-configurable settings. See Settings below.
start_audio_paused
bool
default:"False"
Whether to start with audio input paused.
start_video_paused
bool
default:"False"
Whether to start with video input paused.
inference_on_context_initialization
bool
default:"True"
Whether to generate a response when context is first set. Set to False to wait for user input before the model responds.
http_options
HttpOptions
default:"None"
HTTP options for the Google API client. Use this to set API version (e.g. HttpOptions(api_version="v1alpha")) or other request options.
file_api_base_url
str
Base URL for the Gemini File API.

Settings

Runtime-configurable settings passed via the settings constructor argument using GeminiLiveLLMService.Settings(...). These can be updated mid-conversation with LLMUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNOT_GIVENModel identifier. (Inherited from base settings.)
system_instructionstrNOT_GIVENSystem instruction/prompt. (Inherited from base settings.)
temperaturefloatNOT_GIVENSampling temperature (0.0-2.0). (Inherited from base settings.)
max_tokensintNOT_GIVENMaximum tokens to generate. (Inherited from base settings.)
top_kintNOT_GIVENTop-k sampling parameter. (Inherited from base settings.)
top_pfloatNOT_GIVENTop-p (nucleus) sampling parameter (0.0-1.0). (Inherited from base settings.)
frequency_penaltyfloatNOT_GIVENFrequency penalty for generation (0.0-2.0). (Inherited from base settings.)
presence_penaltyfloatNOT_GIVENPresence penalty for generation (0.0-2.0). (Inherited from base settings.)
voicestrNOT_GIVENTTS voice identifier (e.g. "Charon", "Puck").
modalitiesGeminiModalitiesNOT_GIVENResponse modality: GeminiModalities.AUDIO or GeminiModalities.TEXT.
languageLanguage | strNOT_GIVENLanguage for generation and transcription.
media_resolutionGeminiMediaResolutionNOT_GIVENMedia resolution for video input: UNSPECIFIED, LOW, MEDIUM, or HIGH.
vadGeminiVADParamsNOT_GIVENVoice activity detection parameters. See GeminiVADParams below.
context_window_compressionContextWindowCompressionParams | dictNOT_GIVENContext window compression settings.
thinkingThinkingConfig | dictNOT_GIVENThinking/reasoning configuration. Requires a model that supports it.
enable_affective_dialogboolNOT_GIVENEnable affective dialog for expression and tone adaptation.
proactivityProactivityConfig | dictNOT_GIVENProactivity settings for model behavior.
NOT_GIVEN values are omitted, letting the service use its own defaults (e.g. "models/gemini-2.5-flash-native-audio-preview-12-2025" for model, "Charon" for voice, 4096 for max_tokens). Only parameters that are explicitly set are included.

GeminiVADParams

Voice activity detection configuration passed via the vad Settings field:
ParameterTypeDefaultDescription
disabledboolNoneWhether to disable server-side VAD entirely.
start_sensitivityStartSensitivityNoneSensitivity for speech start detection.
end_sensitivityEndSensitivityNoneSensitivity for speech end detection.
prefix_padding_msintNonePadding before speech starts in milliseconds.
silence_duration_msintNoneSilence duration threshold in milliseconds to detect speech end.

ContextWindowCompressionParams

ParameterTypeDefaultDescription
enabledboolFalseWhether context window compression is enabled.
trigger_tokensintNoneToken count to trigger compression. None uses the default (80% of context window).

Usage

Basic Setup

import os
from pipecat.services.google.gemini_live import GeminiLiveLLMService

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    settings=GeminiLiveLLMService.Settings(
        voice="Charon",
        system_instruction="You are a helpful assistant.",
    ),
)

With Settings

from pipecat.services.google.gemini_live import (
    GeminiLiveLLMService,
    GeminiVADParams,
    ContextWindowCompressionParams,
)

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    settings=GeminiLiveLLMService.Settings(
        model="models/gemini-2.5-flash-native-audio-preview-12-2025",
        system_instruction="You are a helpful assistant.",
        voice="Puck",
        temperature=0.7,
        max_tokens=2048,
        language="en-US",
        vad=GeminiVADParams(
            silence_duration_ms=500,
        ),
        context_window_compression={"enabled": True},
    ),
)

Text-Only Mode

from pipecat.services.google.gemini_live import (
    GeminiLiveLLMService,
    GeminiModalities,
)

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    settings=GeminiLiveLLMService.Settings(
        system_instruction="You are a helpful assistant.",
        modalities=GeminiModalities.TEXT,
    ),
)

With Thinking Enabled

from google.genai.types import ThinkingConfig

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    settings=GeminiLiveLLMService.Settings(
        model="models/gemini-2.5-flash-native-audio-preview-12-2025",
        system_instruction="You are a helpful assistant.",
        thinking=ThinkingConfig(include_thoughts=True),
    ),
)
The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

  • System instruction precedence: If a system instruction is provided both at init time and in the LLM context, the context-provided value takes precedence.
  • Tools precedence: Similarly, tools provided in the context override tools provided at init time.
  • Transcription aggregation: Gemini Live sends user transcriptions in small chunks. The service aggregates them into complete sentences using end-of-sentence detection with a 0.5-second timeout fallback.
  • Session resumption: The service automatically handles session resumption on reconnection using session resumption handles.
  • Connection resilience: The service will attempt up to 3 consecutive reconnections before treating a connection failure as fatal.
  • Video frame rate: Video frames are throttled to a maximum of one per second.
  • Affective dialog and proactivity: These features require both a supporting model and API version (v1alpha).