Overview
ElevenLabs provides two STT service implementations:ElevenLabsSTTService(HTTP) — File-based transcription using ElevenLabs’ Speech-to-Text API with segmented audio processing. Uploads audio files and receives transcription results directly.ElevenLabsRealtimeSTTService(WebSocket) — Real-time streaming transcription with ultra-low latency, supporting both partial (interim) and committed (final) transcripts with manual or VAD-based commit strategies.
ElevenLabs STT API Reference
Pipecat’s API methods for ElevenLabs STT integration
Example Implementation
Complete example with ElevenLabs STT and TTS
ElevenLabs Documentation
Official ElevenLabs STT API documentation
ElevenLabs Platform
Access API keys and speech-to-text models
Installation
To use ElevenLabs STT services, install the required dependencies:Prerequisites
ElevenLabs Account Setup
Before using ElevenLabs STT services, you need:- ElevenLabs Account: Sign up at ElevenLabs Platform
- API Key: Generate an API key from your account dashboard
- Model Access: Ensure access to the Scribe v2 transcription model (default:
scribe_v2)
Required Environment Variables
ELEVENLABS_API_KEY: Your ElevenLabs API key for authentication
ElevenLabsSTTService
ElevenLabs API key for authentication.
An aiohttp session for HTTP requests. You must create and manage this
yourself.
Base URL for the ElevenLabs API.
Model ID for transcription. Deprecated in v0.0.105. Use
settings=ElevenLabsSTTService.Settings(...) instead.Audio sample rate in Hz. When
None, uses the pipeline’s configured sample
rate.Runtime-configurable settings for the STT service. See Settings
below.
Configuration parameters for the STT service. Deprecated in v0.0.105. Use
settings=ElevenLabsSTTService.Settings(...) instead.P99 latency from speech end to final transcript in seconds. Override for your
deployment.
Settings
Runtime-configurable settings passed via thesettings constructor argument using ElevenLabsSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | Model ID for transcription. (Inherited from base STT settings.) |
language | Language | str | None | Target language for transcription. (Inherited from base STT settings.) |
tag_audio_events | bool | True | Include audio events like (laughter), (coughing) in transcription. |
Usage
With Language and Audio Events
Notes
- The HTTP service uploads complete audio segments and is best for VAD-segmented transcription.
- Does not have connection events since it uses per-request HTTP calls.
ElevenLabsRealtimeSTTService
ElevenLabs API key for authentication.
Base URL for the ElevenLabs WebSocket API.
Model ID for real-time transcription. Deprecated in v0.0.105. Use
settings=ElevenLabsRealtimeSTTService.Settings(...) instead.Audio sample rate in Hz. When
None, uses the pipeline’s configured sample
rate.Runtime-configurable settings for the Realtime STT service. See
Settings below.
How to segment speech.
CommitStrategy.MANUAL uses Pipecat’s VAD to control
when transcript segments are committed. CommitStrategy.VAD uses ElevenLabs’
built-in VAD for segment boundaries.Whether to include word-level timestamps in transcripts.
Whether to enable logging on ElevenLabs’ side.
Whether to include language detection in transcripts.
Configuration parameters for the STT service. Deprecated in v0.0.105. Use
settings=ElevenLabsRealtimeSTTService.Settings(...) instead.P99 latency from speech end to final transcript in seconds. Override for your
deployment.
Settings
Runtime-configurable settings passed via thesettings constructor argument using ElevenLabsRealtimeSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | Model ID for transcription. (Inherited from base STT settings.) |
language | Language | str | None | Language for speech recognition. (Inherited from base STT settings.) |
vad_silence_threshold_secs | float | None | Seconds of silence before VAD commits (0.3-3.0). Only used with VAD commit strategy. |
vad_threshold | float | None | VAD sensitivity (0.1-0.9, lower is more sensitive). Only used with VAD commit strategy. |
min_speech_duration_ms | int | None | Minimum speech duration for VAD (50-2000ms). Only used with VAD commit strategy. |
min_silence_duration_ms | int | None | Minimum silence duration for VAD (50-2000ms). Only used with VAD commit strategy. |
Usage
With Timestamps and Custom Commit Strategy
Notes
- Commit strategies: Defaults to
manualcommit strategy, where Pipecat’s VAD controls when transcription segments are committed. Setcommit_strategy=CommitStrategy.VADto let ElevenLabs handle segment boundaries. When usingMANUALcommit strategy, transcription frames are marked as finalized (TranscriptionFrame.finalized=True). - Keepalive: Sends silent audio chunks as keepalive to prevent idle disconnections (keepalive interval: 5s, timeout: 10s).
- Auto-reconnect: Automatically reconnects if the WebSocket connection is closed when new audio arrives.
Event Handlers
Supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to ElevenLabs Realtime STT WebSocket |
on_disconnected | Disconnected from ElevenLabs Realtime STT WebSocket |