Skip to main content
Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.

Overview

XTTSTTSService provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model for cross-lingual text-to-speech.

Installation

XTTS requires a running streaming server. Start the server using Docker:
docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 \
  ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121

Prerequisites

XTTS Server Setup

Before using XTTSTTSService, you need:
  1. Docker Environment: Set up Docker with GPU support for optimal performance
  2. XTTS Server: Run the XTTS streaming server container
  3. Voice Models: Configure voice models and cloning samples as needed

Required Configuration

  • Server URL: Configure the XTTS server endpoint (default: http://localhost:8000)
  • Voice Selection: Set up voice models or voice cloning samples
GPU acceleration is recommended for optimal performance. The server requires CUDA support for best results.

Configuration

XTTSService

voice_id
str
required
deprecated
ID of the studio speaker to use for synthesis. Deprecated in v0.0.105. Use settings=XTTSService.Settings(voice=...) instead.
base_url
str
required
Base URL of the XTTS streaming server (e.g. http://localhost:8000).
aiohttp_session
aiohttp.ClientSession
required
An aiohttp session for HTTP requests to the XTTS server.
language
Language
default:"Language.EN"
deprecated
Language for synthesis. Supports Czech, German, English, Spanish, French, Hindi, Hungarian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Turkish, and Chinese. Deprecated in v0.0.105. Use settings=XTTSService.Settings(language=...) instead.
settings
XTTSService.Settings
default:"None"
Runtime-configurable settings. See Settings below.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate. Audio is automatically resampled from XTTS’s native 24kHz output.

Settings

Runtime-configurable settings passed via the settings constructor argument using XTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneModel identifier. (Inherited.)
voicestrNoneVoice identifier. (Inherited.)
languageLanguage | strNoneLanguage for synthesis. (Inherited.)

Usage

Basic Setup

import aiohttp
from pipecat.services.xtts import XTTSService

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        settings=XTTSService.Settings(
            voice="Ana Florence",
        ),
        base_url="http://localhost:8000",
        aiohttp_session=session,
    )

With Language Configuration

import aiohttp
from pipecat.services.xtts import XTTSService
from pipecat.transcriptions.language import Language

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        settings=XTTSService.Settings(
            voice="Ana Florence",
        ),
        base_url="http://localhost:8000",
        aiohttp_session=session,
        language=Language.ES,
    )
The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

  • Local server required: XTTS requires a locally running streaming server (via Docker). The service connects to this server over HTTP.
  • Studio speakers: On startup, the service fetches available “studio speakers” from the server’s /studio_speakers endpoint. The voice_id must match one of these speakers.
  • Audio resampling: XTTS natively outputs audio at 24kHz. The service automatically resamples to match the pipeline’s configured sample rate.
  • GPU recommended: The XTTS server performs best with CUDA-enabled GPU acceleration. CPU inference is significantly slower.
  • No API key required: XTTS runs locally, so no external API credentials are needed.