Skip to main content

Overview

Azure Cognitive Services provides high-quality text-to-speech synthesis with two service implementations: AzureTTSService (WebSocket-based) for real-time streaming with low latency, and AzureHttpTTSService (HTTP-based) for batch synthesis. AzureTTSService is recommended for interactive applications requiring streaming capabilities.

Installation

To use Azure services, install the required dependencies:
pip install "pipecat-ai[azure]"

Prerequisites

Azure Account Setup

Before using Azure TTS services, you need:
  1. Azure Account: Sign up at Azure Portal
  2. Speech Service: Create a Speech resource in your Azure subscription
  3. API Key and Region: Get your subscription key and service region
  4. Voice Selection: Choose from available voices in the Voice Gallery

Required Environment Variables

  • AZURE_SPEECH_API_KEY: Your Azure Speech service API key
  • AZURE_SPEECH_REGION: Your Azure Speech service region (e.g., “eastus”)

Configuration

AzureTTSService

api_key
str
required
Azure Cognitive Services subscription key.
region
str
required
Azure region identifier (e.g., "eastus", "westus2").
voice
str
default:"en-US-SaraNeural"
deprecated
Voice name to use for synthesis. Deprecated in v0.0.105. Use settings=AzureTTSService.Settings(voice=...) instead.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
text_aggregation_mode
TextAggregationMode
default:"TextAggregationMode.SENTENCE"
Controls how incoming text is aggregated before synthesis. SENTENCE (default) buffers text until sentence boundaries, producing more natural speech. TOKEN streams tokens directly for lower latency. Import from pipecat.services.tts_service.
aggregate_sentences
bool
default:"None"
deprecated
Deprecated in v0.0.104. Use text_aggregation_mode instead.
params
InputParams
default:"None"
deprecated
Deprecated in v0.0.105. Use settings=AzureTTSService.Settings(...) instead.
settings
AzureTTSService.Settings
default:"None"
Runtime-configurable settings. See Settings below.

AzureHttpTTSService

The HTTP service accepts the same parameters as the streaming service except text_aggregation_mode and aggregate_sentences:
api_key
str
required
Azure Cognitive Services subscription key.
region
str
required
Azure region identifier.
voice
str
default:"en-US-SaraNeural"
deprecated
Voice name to use for synthesis. Deprecated in v0.0.105. Use settings=AzureHttpTTSService.Settings(voice=...) instead.
sample_rate
int
default:"None"
Output audio sample rate in Hz.
params
InputParams
default:"None"
deprecated
Deprecated in v0.0.105. Use settings=AzureHttpTTSService.Settings(...) instead.
settings
AzureHttpTTSService.Settings
default:"None"
Runtime-configurable settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using AzureTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneModel identifier. (Inherited.)
voicestrNoneVoice identifier. (Inherited.)
languageLanguage | strNoneLanguage for synthesis. (Inherited.)
emphasisstrNOT_GIVENEmphasis level for SSML.
pitchstrNOT_GIVENPitch adjustment.
ratestrNOT_GIVENSpeaking rate.
rolestrNOT_GIVENRole for SSML.
stylestrNOT_GIVENSpeaking style.
style_degreestrNOT_GIVENDegree of the speaking style.
volumestrNOT_GIVENVolume level.

Usage

Basic Setup

from pipecat.services.azure import AzureTTSService

tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    settings=AzureTTSService.Settings(
        voice="en-US-SaraNeural",
    ),
)

With Voice Customization

from pipecat.transcriptions.language import Language

tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region="eastus",
    settings=AzureTTSService.Settings(
        voice="en-US-JennyMultilingualNeural",
        language=Language.EN_US,
        style="cheerful",
        style_degree="1.5",
        rate="1.1",
    ),
)

HTTP Service

from pipecat.services.azure import AzureHttpTTSService

tts = AzureHttpTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    voice="en-US-SaraNeural",
)
The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

  • Streaming vs HTTP: The streaming service (AzureTTSService) provides word-level timestamps and lower latency, making it better for interactive conversations. The HTTP service is simpler but returns the complete audio at once.
  • SSML support: Both services automatically construct SSML from the Settings. Special characters in text are automatically escaped.
  • Word timestamps: AzureTTSService supports word-level timestamps for synchronized text display. CJK languages receive special handling to merge individual characters into meaningful word units.
  • 8kHz workaround: At 8kHz sample rates, Azure’s reported audio duration may not match word boundary offsets. The service uses word boundary offsets for timing in this case.