Overview
OpenAI provides two STT service implementations:OpenAISTTService(HTTP) — VAD-segmented speech recognition using OpenAI’s transcription API, supporting GPT-4o transcription and Whisper models.OpenAIRealtimeSTTService(WebSocket) — Real-time streaming speech-to-text using OpenAI’s Realtime API transcription sessions, with support for local VAD and server-side VAD modes.
OpenAI STT API Reference
Pipecat’s API methods for OpenAI STT integration
Example Implementation
Complete example with OpenAI ecosystem integration
OpenAI Documentation
Official OpenAI transcription documentation and features
OpenAI Platform
Access API keys and transcription models
Installation
To use OpenAI services, install the required dependency:Prerequisites
OpenAI Account Setup
Before using OpenAI STT services, you need:- OpenAI Account: Sign up at OpenAI Platform
- API Key: Generate an API key from your account dashboard
- Model Access: Ensure access to GPT-4o transcription and Whisper models
Required Environment Variables
OPENAI_API_KEY: Your OpenAI API key for authentication
OpenAISTTService
Uses VAD-based audio segmentation with HTTP transcription requests. Records speech segments detected by local VAD and sends them to OpenAI’s transcription API.Transcription model to use. Options include
"gpt-4o-transcribe",
"gpt-4o-mini-transcribe", and "whisper-1". Deprecated in v0.0.105. Use
settings=OpenAISTTService.Settings(...) instead.OpenAI API key. Falls back to the
OPENAI_API_KEY environment variable.API base URL. Override for custom or proxied deployments.
Language of the audio input. Deprecated in v0.0.105. Use
settings=OpenAISTTService.Settings(...) instead.Optional text to guide the model’s style or continue a previous segment.
Deprecated in v0.0.105. Use
settings=OpenAISTTService.Settings(...)
instead.Sampling temperature between 0 and 1. Lower values produce more deterministic
results. Deprecated in v0.0.105. Use
settings=OpenAISTTService.Settings(...) instead.Runtime-configurable settings for the STT service. See Settings
below.
P99 latency from speech end to final transcript in seconds. Override for your
deployment.
If true, allow empty
TranscriptionFrame frames to be pushed downstream
instead of discarding them. This is intended for situations where VAD fires
even though the user did not speak. In these cases, it is useful to know that
nothing was transcribed so that the agent can resume speaking, instead of
waiting longer for a transcription.Settings
Runtime-configurable settings passed via thesettings constructor argument using OpenAISTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | "gpt-4o-transcribe" | Transcription model to use. (Inherited from base STT settings.) |
language | Language | str | Language.EN | Language of the audio input. (Inherited from base STT settings.) |
prompt | str | None | Optional text to guide the model’s style or continue a previous segment. |
temperature | float | None | Sampling temperature between 0 and 1. |
Usage
Notes
- Segmented transcription: Processes complete audio segments (after VAD detects silence) via HTTP. Only produces final transcriptions, not interim results.
- Does not have WebSocket connection events since it uses per-request HTTP calls.
OpenAIRealtimeSTTService
Real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions. Audio is streamed continuously over a WebSocket connection for lower latency compared to HTTP-based transcription.OpenAI API key for authentication.
Transcription model. Supported values are
"gpt-4o-transcribe" and
"gpt-4o-mini-transcribe". Deprecated in v0.0.105. Use
settings=OpenAIRealtimeSTTService.Settings(...) instead.WebSocket base URL for the Realtime API.
Language of the audio input. Deprecated in v0.0.105. Use
settings=OpenAIRealtimeSTTService.Settings(...) instead.Optional prompt text to guide transcription style or provide keyword hints.
Deprecated in v0.0.105. Use
settings=OpenAIRealtimeSTTService.Settings(...)
instead.Runtime-configurable settings for the Realtime STT service. See
Settings below.
Server-side VAD configuration. Defaults to
False (disabled), which relies on a local VAD processor in the pipeline. Pass None to use server defaults (server_vad), or a dict with custom settings (e.g. {"type": "server_vad", "threshold": 0.5}).Noise reduction mode.
"near_field" for close microphones, "far_field" for
distant microphones, or None to disable.Whether to interrupt bot output when speech is detected by server-side VAD.
Only applies when turn detection is enabled.
P99 latency from speech end to final transcript in seconds. Override for your
deployment.
Settings
Runtime-configurable settings passed via thesettings constructor argument using OpenAIRealtimeSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | "gpt-4o-transcribe" | Transcription model to use. (Inherited from base STT settings.) |
language | Language | str | Language.EN | Language of the audio input. (Inherited from base STT settings.) |
prompt | str | None | Optional prompt text to guide transcription style or keyword hints. |
Usage
With Local VAD
With Server-Side VAD
Notes
- Local VAD vs Server-side VAD: Defaults to local VAD mode (
turn_detection=False), where a local VAD processor in the pipeline controls when audio is committed for transcription. Setturn_detection=Nonefor server-side VAD, but do not use a separate VAD processor in the pipeline in that mode. - Automatic resampling: Automatically resamples audio to 24 kHz as required by the Realtime API, regardless of the pipeline’s sample rate.
- Interim transcriptions: Produces interim transcriptions via delta events for real-time feedback.
Event Handlers
Supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to OpenAI Realtime WebSocket |
on_disconnected | Disconnected from OpenAI Realtime WebSocket |