XTTS

Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.

Overview

XTTSTTSService provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model for cross-lingual text-to-speech.

XTTS API Reference

Pipecat’s API methods for XTTS integration

Example Implementation

Complete example with voice cloning

XTTS Repository

Official XTTS streaming server repository

Voice Cloning

Learn about custom voice training

Installation

XTTS requires a running streaming server. Start the server using Docker:

docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 \
  ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121

Prerequisites

XTTS Server Setup

Before using XTTSTTSService, you need:

Docker Environment: Set up Docker with GPU support for optimal performance
XTTS Server: Run the XTTS streaming server container
Voice Models: Configure voice models and cloning samples as needed

Required Configuration

Server URL: Configure the XTTS server endpoint (default: http://localhost:8000)
Voice Selection: Set up voice models or voice cloning samples

GPU acceleration is recommended for optimal performance. The server requires CUDA support for best results.

Configuration

XTTSService

voice_id

str

required

deprecated

ID of the studio speaker to use for synthesis. Deprecated in v0.0.105. Use settings=XTTSService.Settings(voice=...) instead.

base_url

str

required

Base URL of the XTTS streaming server (e.g. http://localhost:8000).

aiohttp_session

aiohttp.ClientSession

required

An aiohttp session for HTTP requests to the XTTS server.

language

Language

default:"Language.EN"

deprecated

Language for synthesis. Supports Czech, German, English, Spanish, French, Hindi, Hungarian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Turkish, and Chinese. Deprecated in v0.0.105. Use settings=XTTSService.Settings(language=...) instead.

settings

XTTSService.Settings

default:"None"

Runtime-configurable settings. See Settings below.

sample_rate

int

default:"None"

Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate. Audio is automatically resampled from XTTS’s native 24kHz output.

Settings

Runtime-configurable settings passed via the settings constructor argument using XTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`None`	Model identifier. (Inherited.)
`voice`	`str`	`None`	Voice identifier. (Inherited.)
`language`	`Language \| str`	`None`	Language for synthesis. (Inherited.)

Usage

Basic Setup

import aiohttp
from pipecat.services.xtts import XTTSService

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        settings=XTTSService.Settings(
            voice="Ana Florence",
        ),
        base_url="http://localhost:8000",
        aiohttp_session=session,
    )

With Language Configuration

import aiohttp
from pipecat.services.xtts import XTTSService
from pipecat.transcriptions.language import Language

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        settings=XTTSService.Settings(
            voice="Ana Florence",
        ),
        base_url="http://localhost:8000",
        aiohttp_session=session,
        language=Language.ES,
    )

The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

Local server required: XTTS requires a locally running streaming server (via Docker). The service connects to this server over HTTP.
Studio speakers: On startup, the service fetches available “studio speakers” from the server’s /studio_speakers endpoint. The voice_id must match one of these speakers.
Audio resampling: XTTS natively outputs audio at 24kHz. The service automatically resamples to match the pipeline’s configured sample rate.
GPU recommended: The XTTS server performs best with CUDA-enabled GPU acceleration. CPU inference is significantly slower.
No API key required: XTTS runs locally, so no external API credentials are needed.

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

XTTS API Reference

Example Implementation

XTTS Repository

Voice Cloning

Installation

Prerequisites

XTTS Server Setup

Required Configuration

Configuration

XTTSService

Settings

Usage

Basic Setup

With Language Configuration

Notes

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

XTTS API Reference

Example Implementation

XTTS Repository

Voice Cloning

​Installation

​Prerequisites

​XTTS Server Setup

​Required Configuration

​Configuration

​XTTSService

​Settings

​Usage

​Basic Setup

​With Language Configuration

​Notes

Overview

Installation

Prerequisites

XTTS Server Setup

Required Configuration

Configuration

XTTSService

Settings

Usage

Basic Setup

With Language Configuration

Notes