Skip to main content

Overview

WhisperSTTService provides offline speech recognition using OpenAI’s Whisper models running locally. Supports multiple model sizes and hardware acceleration options including CPU, CUDA, and Apple Silicon (MLX) for privacy-focused transcription without external API calls.

Installation

Choose your installation based on your hardware:

Standard Whisper (CPU/CUDA)

pip install "pipecat-ai[whisper]"

MLX Whisper (Apple Silicon)

pip install "pipecat-ai[mlx-whisper]"

Prerequisites

Local Model Setup

Before using Whisper STT services, you need:
  1. Model Selection: Choose appropriate Whisper model size (tiny, base, small, medium, large)
  2. Hardware Configuration: Set up CPU, CUDA, or Apple Silicon acceleration
  3. Storage Space: Ensure sufficient disk space for model downloads

Configuration Options

  • Model Size: Balance between accuracy and performance based on your hardware
  • Hardware Acceleration: Configure CUDA for NVIDIA GPUs or MLX for Apple Silicon
  • Language Support: Whisper supports 99+ languages out of the box
No API keys required - Whisper runs entirely locally for complete privacy.

Configuration

WhisperSTTService

Uses Faster Whisper for efficient local transcription on CPU or CUDA devices.
model
str | Model
default:"Model.DISTIL_MEDIUM_EN"
deprecated
Whisper model to use. Can be a Model enum value or a string. Available models: TINY, BASE, SMALL, MEDIUM, LARGE (large-v3), LARGE_V3_TURBO, DISTIL_LARGE_V2, DISTIL_MEDIUM_EN (English-only). Deprecated in v0.0.105. Use settings=WhisperSTTService.Settings(...) instead.
device
str
default:"auto"
Device for inference. Options: "cpu", "cuda", or "auto" (auto-detect).
compute_type
str
default:"default"
Compute type for inference. Options include "default", "int8", "int8_float16", "float16", etc.
no_speech_prob
float
default:"0.4"
deprecated
Probability threshold for filtering out non-speech segments. Segments with a no-speech probability above this value are excluded. Deprecated in v0.0.105. Use settings=WhisperSTTService.Settings(...) instead.
language
Language
default:"Language.EN"
deprecated
Default language for transcription. Deprecated in v0.0.105. Use settings=WhisperSTTService.Settings(...) instead.
settings
WhisperSTTService.Settings
default:"None"
Runtime-configurable settings for the STT service. See WhisperSTTService Settings below.

WhisperSTTServiceMLX

Optimized for Apple Silicon using MLX Whisper. Models are loaded on demand.
model
str | MLXModel
default:"MLXModel.TINY"
deprecated
MLX Whisper model to use. Can be an MLXModel enum value or a string. Available models: TINY, MEDIUM, LARGE_V3, LARGE_V3_TURBO, DISTIL_LARGE_V3, LARGE_V3_TURBO_Q4 (quantized). Deprecated in v0.0.105. Use settings=WhisperSTTServiceMLX.Settings(...) instead.
no_speech_prob
float
default:"0.6"
deprecated
Probability threshold for filtering out non-speech segments. Deprecated in v0.0.105. Use settings=WhisperSTTServiceMLX.Settings(...) instead.
language
Language
default:"Language.EN"
deprecated
Default language for transcription. Deprecated in v0.0.105. Use settings=WhisperSTTServiceMLX.Settings(...) instead.
temperature
float
default:"0.0"
deprecated
Sampling temperature. Lower values produce more deterministic results. Deprecated in v0.0.105. Use settings=WhisperSTTServiceMLX.Settings(...) instead.
settings
WhisperSTTServiceMLX.Settings
default:"None"
Runtime-configurable settings for the MLX STT service. See WhisperSTTServiceMLX Settings below.

WhisperSTTService Settings

Runtime-configurable settings passed via the settings constructor argument using WhisperSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrModel.DISTIL_MEDIUM_ENWhisper model to use. (Inherited from base STT settings.)
languageLanguage | strLanguage.ENDefault language for transcription. (Inherited from base STT settings.)
no_speech_probfloat0.4Probability threshold for filtering out non-speech segments.

WhisperSTTServiceMLX Settings

Runtime-configurable settings passed via the settings constructor argument using WhisperSTTServiceMLX.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrMLXModel.TINYMLX Whisper model to use. (Inherited from base STT settings.)
languageLanguage | strLanguage.ENDefault language for transcription. (Inherited from base STT settings.)
no_speech_probfloat0.6Probability threshold for filtering out non-speech segments.
temperaturefloat0.0Sampling temperature. Lower values are more deterministic.
enginestr"mlx"Whisper engine identifier.

Usage

Basic Faster Whisper Setup

from pipecat.services.whisper.stt import WhisperSTTService

stt = WhisperSTTService(
    settings=WhisperSTTService.Settings(
        model="base",
    ),
)

With CUDA Acceleration

from pipecat.services.whisper.stt import WhisperSTTService, Model

stt = WhisperSTTService(
    device="cuda",
    compute_type="float16",
    settings=WhisperSTTService.Settings(
        model=Model.LARGE,
    ),
)

With Custom Language

from pipecat.services.whisper.stt import WhisperSTTService, Model
from pipecat.transcriptions.language import Language

stt = WhisperSTTService(
    settings=WhisperSTTService.Settings(
        model=Model.MEDIUM,
        language=Language.FR,
        no_speech_prob=0.5,
    ),
)

MLX Whisper on Apple Silicon

from pipecat.services.whisper.stt import WhisperSTTServiceMLX, MLXModel
from pipecat.transcriptions.language import Language

stt = WhisperSTTServiceMLX(
    settings=WhisperSTTServiceMLX.Settings(
        model=MLXModel.LARGE_V3_TURBO,
        language=Language.EN,
        temperature=0.0,
    ),
)
The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

  • First run downloads: If the selected model hasn’t been downloaded previously, the first run will download it from the Hugging Face model hub. This may take significant time depending on model size.
  • Segmented transcription: Both WhisperSTTService and WhisperSTTServiceMLX extend SegmentedSTTService, meaning they process complete audio segments after VAD detects the user has stopped speaking.
  • No-speech filtering: The no_speech_prob threshold helps filter out hallucinations. Increase it to be more permissive, decrease it to filter more aggressively.
  • MLX quantization: The LARGE_V3_TURBO_Q4 model provides reduced memory usage with minimal quality loss on Apple Silicon.
  • Language support: Whisper supports 99+ languages. Use the Language enum for type-safe language selection.