Overview
AssemblyAISTTService provides real-time speech recognition using AssemblyAI’s WebSocket API with support for interim results, end-of-turn detection, and configurable audio processing parameters for accurate transcription in conversational AI applications.
AssemblyAI STT API Reference
Pipecat’s API methods for AssemblyAI STT integration
Example Implementation
Example with AssemblyAI built-in turn detection
Universal-3 Pro Streaming
U3 Pro streaming documentation and features
U3 Pro API Reference
Complete U3 Pro streaming API reference
AssemblyAI Console
Access API keys and transcription features
Installation
To use AssemblyAI services, install the required dependency:Prerequisites
AssemblyAI Account Setup
Before using AssemblyAI STT services, you need:- AssemblyAI Account: Sign up at AssemblyAI Console
- API Key: Generate an API key from your dashboard
- Model Selection: Choose from available transcription models and features
Required Environment Variables
ASSEMBLYAI_API_KEY: Your AssemblyAI API key for authentication
Configuration
AssemblyAISTTService
AssemblyAI API key for authentication.
Language code for transcription. AssemblyAI currently supports English.
Deprecated in v0.0.105. Use
settings=AssemblyAISTTService.Settings(...)
instead.WebSocket endpoint URL. Override for custom or proxied deployments.
Audio sample rate in Hz.
Audio encoding format.
Connection configuration parameters. Deprecated in v0.0.105. Use
settings=AssemblyAISTTService.Settings(...) instead. See
AssemblyAIConnectionParams below for field
mapping.Controls turn detection mode. When
True (Pipecat mode, default): Forces
AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart
Turn) decides when the user is done. VAD stop sends ForceEndpoint as ceiling.
No UserStarted/StoppedSpeakingFrame emitted from STT. When False (AssemblyAI
turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings
using built-in turn detection. Uses AssemblyAI API defaults for all parameters
unless explicitly set. Emits UserStarted/StoppedSpeakingFrame from STT.Whether to interrupt the bot when the user starts speaking in AssemblyAI turn
detection mode (
vad_force_turn_endpoint=False). Only applies when using
AssemblyAI’s built-in turn detection.Optional format string for speaker labels when diarization is enabled. Use
{speaker} for speaker label and {text} for transcript text. Example:
"<{speaker}>{text}</{speaker}>" or "{speaker}: {text}". If None, transcript
text is not modified.Runtime-configurable settings for the STT service. See Settings
below.
P99 latency from speech end to final transcript in seconds. Override for your
deployment.
AssemblyAIConnectionParams
Connection-level parameters previously passed via theconnection_params constructor argument.
| Parameter | Type | Default | Description |
|---|---|---|---|
sample_rate | int | 16000 | Audio sample rate in Hz. |
encoding | Literal | "pcm_s16le" | Audio encoding format. Options: "pcm_s16le", "pcm_mulaw". |
end_of_turn_confidence_threshold | float | None | Confidence threshold for end-of-turn detection. |
min_turn_silence | int | None | Minimum silence duration (ms) when confident about end-of-turn. |
min_end_of_turn_silence_when_confident | int | None | DEPRECATED. Use min_turn_silence instead. Will be removed in a future version. |
max_turn_silence | int | None | Maximum silence duration (ms) before forcing end-of-turn. |
keyterms_prompt | List[str] | None | List of key terms to guide transcription. Will be JSON serialized before sending. |
prompt | str | None | BETA: Optional text prompt to guide transcription. Only used when speech_model is "u3-rt-pro". Cannot be used with keyterms_prompt. We suggest starting with no prompt. See AssemblyAI prompting best practices for guidance. |
speech_model | Literal | "u3-rt-pro" | Required. Speech model to use. Options: "universal-streaming-english", "universal-streaming-multilingual", "u3-rt-pro". Defaults to "u3-rt-pro" if not specified. |
language_detection | bool | None | Enable automatic language detection. Only applicable to universal-streaming-multilingual. Turn messages include language information. |
format_turns | bool | True | Whether to format transcript turns. Only applicable to universal-streaming-english and universal-streaming-multilingual models. For u3-rt-pro, formatting is automatic and built-in. |
speaker_labels | bool | None | Enable speaker diarization. Final transcripts include a speaker field (e.g., “Speaker A”, “Speaker B”). |
vad_threshold | float | None | Voice activity detection confidence threshold. Only applicable to u3-rt-pro. The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. Defaults to 0.3 (API default). For best performance when using with external VAD (e.g., Silero), align this value with your VAD’s activation threshold. Defaults to None (not sent). |
Settings
Runtime-configurable settings passed via thesettings constructor argument using AssemblyAISTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | STT model identifier. (Inherited from base STT settings.) |
language | Language | str | Language.EN | Language for speech recognition. (Inherited from base STT settings.) |
formatted_finals | bool | True | Whether to enable transcript formatting. |
word_finalization_max_wait_time | int | None | Maximum time to wait for word finalization in milliseconds. |
end_of_turn_confidence_threshold | float | None | Confidence threshold for end-of-turn detection. |
min_turn_silence | int | None | Minimum silence duration (ms) when confident about end-of-turn. |
max_turn_silence | int | None | Maximum silence duration (ms) before forcing end-of-turn. |
keyterms_prompt | List[str] | None | List of key terms to guide transcription. |
prompt | str | None | Optional text prompt to guide transcription (u3-rt-pro only). |
language_detection | bool | None | Enable automatic language detection. |
format_turns | bool | True | Whether to format transcript turns. |
speaker_labels | bool | None | Enable speaker diarization. |
vad_threshold | float | None | VAD confidence threshold (0.0–1.0) for classifying audio frames as silence. |
Usage
Basic Setup
With Custom Settings
With AssemblyAI Built-in Turn Detection
AssemblyAI’s u3-rt-pro model supports built-in turn detection for more natural conversation flow:With Speaker Diarization
Enable speaker identification for multi-party conversations:Notes
- u3-rt-pro model: The default model is now
u3-rt-pro, which provides the best performance and supports built-in turn detection. - Turn detection modes:
- Pipecat mode (
vad_force_turn_endpoint=True, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. The service sends aForceEndpointmessage when VAD detects the user has stopped speaking. - AssemblyAI mode (
vad_force_turn_endpoint=False, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. The service emitsUserStartedSpeakingFrameandUserStoppedSpeakingFramebased on AssemblyAI’s detection.
- Pipecat mode (
- Speaker diarization: Enable
speaker_labels=Truein Settings to automatically identify different speakers. Final transcripts will include a speaker field (e.g., “Speaker A”, “Speaker B”). Use thespeaker_formatparameter to format transcripts with speaker labels. - Language detection: When using
universal-streaming-multilingualwithlanguage_detection=True, Turn messages includelanguage_codeandlanguage_confidencefields for automatic language detection. - Prompting: The
promptparameter (u3-rt-pro only) allows you to guide transcription for specific names, terms, or domain vocabulary. This is a beta feature - AssemblyAI recommends testing without a prompt first. Cannot be used withkeyterms_prompt. - Dynamic settings updates: You can update
keyterms_prompt,prompt,min_turn_silence, andmax_turn_silenceat runtime usingSTTUpdateSettingsFramewithout reconnecting.
Event Handlers
AssemblyAI STT supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to AssemblyAI WebSocket |
on_disconnected | Disconnected from AssemblyAI WebSocket |