Overview
NVIDIA Riva provides two STT service implementations:NvidiaSTTService— Real-time streaming transcription using Parakeet models with interim results and continuous audio processing.NvidiaSegmentedSTTService— Segmented transcription using Canary models with advanced language support, word boosting, and enterprise-grade accuracy.
NVIDIA Riva STT API Reference
Pipecat’s API methods for NVIDIA Riva STT integration
Example Implementation
Complete example with NVIDIA services integration
NVIDIA Riva Documentation
Official NVIDIA Riva ASR documentation
NVIDIA Developer Portal
Access API keys and Riva services
Installation
To use NVIDIA Riva services, install the required dependency:Prerequisites
NVIDIA Riva Setup
Before using NVIDIA Riva STT services, you need:- NVIDIA Developer Account: Sign up at NVIDIA Developer Portal
- API Key: Generate an NVIDIA API key for Riva services
- Model Selection: Choose between Parakeet (streaming) and Canary (segmented) models
Required Environment Variables
NVIDIA_API_KEY: Your NVIDIA API key for authentication
NvidiaSTTService
Real-time streaming transcription using NVIDIA Riva’s Parakeet models.NVIDIA API key for authentication.
NVIDIA Riva server address.
Mapping containing
function_id and model_name for the ASR model.Audio sample rate in Hz. When
None, uses the pipeline’s configured sample
rate.Additional configuration parameters. Deprecated in v0.0.105. Use
settings=NvidiaSTTService.Settings(...) instead.Whether to use SSL for the gRPC connection.
P99 latency from speech end to final transcript in seconds. Override for your
deployment. See stt-benchmark.
Settings
Runtime-configurable settings passed via thesettings constructor argument using NvidiaSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | STT model identifier. (Inherited from base STT settings.) |
language | Language | str | Language.EN_US | Target language for transcription. (Inherited from base STT settings.) |
Usage
Notes
- Model cannot be changed after initialization: Use the
model_function_mapparameter in the constructor to specify the model and function ID. - Streaming: Provides real-time interim and final results through continuous audio streaming.
NvidiaSegmentedSTTService
Batch/segmented transcription using NVIDIA Riva’s Canary models. Processes complete audio segments after VAD detects speech boundaries.NVIDIA API key for authentication.
NVIDIA Riva server address.
Mapping containing
function_id and model_name for the ASR model.Audio sample rate in Hz. When
None, uses the pipeline’s configured sample
rate.Additional configuration parameters. Deprecated in v0.0.105. Use
settings=NvidiaSegmentedSTTService.Settings(...) instead.Runtime-configurable settings. See Settings below.
Whether to use SSL for the gRPC connection.
P99 latency from speech end to final transcript in seconds. Override for your
deployment. See stt-benchmark.
Settings
Runtime-configurable settings passed via thesettings constructor argument using NvidiaSegmentedSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | STT model identifier. (Inherited from base STT settings.) |
language | Language | str | Language.EN_US | Target language for transcription. (Inherited from base STT settings.) |
profanity_filter | bool | False | Whether to filter profanity from results. |
automatic_punctuation | bool | True | Whether to add automatic punctuation. |
verbatim_transcripts | bool | False | Whether to return verbatim transcripts. |
boosted_lm_words | list[str] | None | List of words to boost in the language model. |
boosted_lm_score | float | 4.0 | Score boost for specified words. |
Usage
Notes
- Model cannot be changed after initialization: Use the
model_function_mapparameter in the constructor to specify the model and function ID. - Segmented processing: Processes complete audio segments for higher accuracy compared to streaming.
- Language support: Supports Arabic, English (US/GB), French, German, Hindi, Italian, Japanese, Korean, Portuguese (BR), Russian, and Spanish (ES/US).
- Word boosting: Use
boosted_lm_wordsandboosted_lm_scoreto improve recognition of domain-specific terms.