Overview
WhisperSTTService provides offline speech recognition using OpenAI’s Whisper models running locally. Supports multiple model sizes and hardware acceleration options including CPU, CUDA, and Apple Silicon (MLX) for privacy-focused transcription without external API calls.
Whisper STT API Reference
Pipecat’s API methods for Whisper STT integration
Standard Whisper Example
Complete example with standard Whisper
Whisper Documentation
OpenAI’s Whisper research paper and model details
MLX Whisper Example
Apple Silicon optimized example
Installation
Choose your installation based on your hardware:Standard Whisper (CPU/CUDA)
MLX Whisper (Apple Silicon)
Prerequisites
Local Model Setup
Before using Whisper STT services, you need:- Model Selection: Choose appropriate Whisper model size (tiny, base, small, medium, large)
- Hardware Configuration: Set up CPU, CUDA, or Apple Silicon acceleration
- Storage Space: Ensure sufficient disk space for model downloads
Configuration Options
- Model Size: Balance between accuracy and performance based on your hardware
- Hardware Acceleration: Configure CUDA for NVIDIA GPUs or MLX for Apple Silicon
- Language Support: Whisper supports 99+ languages out of the box
Configuration
WhisperSTTService
Uses Faster Whisper for efficient local transcription on CPU or CUDA devices.Whisper model to use. Can be a
Model enum value or a string. Available
models: TINY, BASE, SMALL, MEDIUM, LARGE (large-v3),
LARGE_V3_TURBO, DISTIL_LARGE_V2, DISTIL_MEDIUM_EN (English-only).
Deprecated in v0.0.105. Use settings=WhisperSTTService.Settings(...)
instead.Device for inference. Options:
"cpu", "cuda", or "auto" (auto-detect).Compute type for inference. Options include
"default", "int8",
"int8_float16", "float16", etc.Probability threshold for filtering out non-speech segments. Segments with a
no-speech probability above this value are excluded. Deprecated in v0.0.105.
Use
settings=WhisperSTTService.Settings(...) instead.Default language for transcription. Deprecated in v0.0.105. Use
settings=WhisperSTTService.Settings(...) instead.Runtime-configurable settings for the STT service. See WhisperSTTService
Settings below.
WhisperSTTServiceMLX
Optimized for Apple Silicon using MLX Whisper. Models are loaded on demand.MLX Whisper model to use. Can be an
MLXModel enum value or a string.
Available models: TINY, MEDIUM, LARGE_V3, LARGE_V3_TURBO,
DISTIL_LARGE_V3, LARGE_V3_TURBO_Q4 (quantized). Deprecated in v0.0.105.
Use settings=WhisperSTTServiceMLX.Settings(...) instead.Probability threshold for filtering out non-speech segments. Deprecated in
v0.0.105. Use
settings=WhisperSTTServiceMLX.Settings(...) instead.Default language for transcription. Deprecated in v0.0.105. Use
settings=WhisperSTTServiceMLX.Settings(...) instead.Sampling temperature. Lower values produce more deterministic results.
Deprecated in v0.0.105. Use
settings=WhisperSTTServiceMLX.Settings(...)
instead.Runtime-configurable settings for the MLX STT service. See
WhisperSTTServiceMLX Settings below.
WhisperSTTService Settings
Runtime-configurable settings passed via thesettings constructor argument using WhisperSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | Model.DISTIL_MEDIUM_EN | Whisper model to use. (Inherited from base STT settings.) |
language | Language | str | Language.EN | Default language for transcription. (Inherited from base STT settings.) |
no_speech_prob | float | 0.4 | Probability threshold for filtering out non-speech segments. |
WhisperSTTServiceMLX Settings
Runtime-configurable settings passed via thesettings constructor argument using WhisperSTTServiceMLX.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | MLXModel.TINY | MLX Whisper model to use. (Inherited from base STT settings.) |
language | Language | str | Language.EN | Default language for transcription. (Inherited from base STT settings.) |
no_speech_prob | float | 0.6 | Probability threshold for filtering out non-speech segments. |
temperature | float | 0.0 | Sampling temperature. Lower values are more deterministic. |
engine | str | "mlx" | Whisper engine identifier. |
Usage
Basic Faster Whisper Setup
With CUDA Acceleration
With Custom Language
MLX Whisper on Apple Silicon
Notes
- First run downloads: If the selected model hasn’t been downloaded previously, the first run will download it from the Hugging Face model hub. This may take significant time depending on model size.
- Segmented transcription: Both
WhisperSTTServiceandWhisperSTTServiceMLXextendSegmentedSTTService, meaning they process complete audio segments after VAD detects the user has stopped speaking. - No-speech filtering: The
no_speech_probthreshold helps filter out hallucinations. Increase it to be more permissive, decrease it to filter more aggressively. - MLX quantization: The
LARGE_V3_TURBO_Q4model provides reduced memory usage with minimal quality loss on Apple Silicon. - Language support: Whisper supports 99+ languages. Use the
Languageenum for type-safe language selection.