Create realtime text-to-speech
Establishes a WebSocket connection for real-time text-to-speech generation. This endpoint uses WebSocket protocol (wss://api.together.ai/v1/audio/speech/websocket) for bidirectional streaming communication.
Connection Setup:
- Protocol: WebSocket (wss://)
- Authentication: Pass API key as Bearer token in Authorization header
- Parameters: Sent as query parameters (model, voice, max_partial_length, language)
Client Events:
tts_session.updated: Update session parameters like voice. Thesessionobject also accepts anextra_paramsfield for additional model-specific parameters that fine-tune speech generation behavior, such aspronunciation_dict(a list of pronunciation rules for specific characters or symbols, where each entry uses the format"<source>/<replacement>"(e.g.,["omg/oh my god"]) to override how the model pronounces matching tokens).{ "type": "tts_session.updated", "session": { "voice": "tara", "extra_params": { "pronunciation_dict": ["omg/oh my god"] } } }input_text_buffer.append: Send text chunks for TTS generation{ "type": "input_text_buffer.append", "text": "Hello, this is a test." }input_text_buffer.clear: Clear the buffered text{ "type": "input_text_buffer.clear" }input_text_buffer.commit: Signal end of text input and process remaining text{ "type": "input_text_buffer.commit" }
Server Events:
session.created: Initial session confirmation (sent first){ "event_id": "evt_123456", "type": "session.created", "session": { "id": "session-id", "object": "realtime.tts.session", "modalities": ["text", "audio"], "model": "hexgrad/Kokoro-82M", "voice": "tara" } }conversation.item.input_text.received: Acknowledgment that text was received{ "type": "conversation.item.input_text.received", "text": "Hello, this is a test." }conversation.item.audio_output.delta: Audio chunks as base64-encoded data{ "type": "conversation.item.audio_output.delta", "item_id": "tts_1", "delta": "<base64_encoded_audio_chunk>" }conversation.item.audio_output.done: Audio generation complete for an item{ "type": "conversation.item.audio_output.done", "item_id": "tts_1" }conversation.item.tts.failed: Error occurred{ "type": "conversation.item.tts.failed", "error": { "message": "Error description", "type": "invalid_request_error", "param": null, "code": "invalid_api_key" } }
Text Processing:
- Partial text (no sentence ending) is held in buffer until:
- We believe that the text is complete enough to be processed for TTS generation
- The partial text exceeds
max_partial_lengthcharacters (default: 250) - The
input_text_buffer.commitevent is received
Audio Format:
- Format: Raw PCM (s16le, mono)
- Sample Rate: 24000 Hz
- Encoding: Base64 (per delta event)
- Delivered via
conversation.item.audio_output.deltaevents
Error Codes:
invalid_api_key: Invalid API key provided (401)missing_api_key: Authorization header missing (401)model_not_available: Invalid or unavailable model (400)- Invalid text format errors (400)
Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Multi-context support
All client and server message types support an optionalcontext_id field. This allows you to manage multiple independent TTS streams over a single WebSocket connection.
| Field | Type | Required | Description |
|---|---|---|---|
context_id | string | No | Identifies which context this message applies to. Defaults to "default" if omitted. For tts_session.updated, omitting context_id updates all contexts. |
Additional client message types
context.cancel — Cancel and clean up a specific context.
Additional server message types
context.cancelled — Confirms a context was cancelled.
Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Query Parameters
The TTS model to use for speech generation. Can also be set via tts_session.updated event.
hexgrad/Kokoro-82M, cartesia/sonic-english The voice to use for speech generation. Default is 'tara'.
Available voices vary by model. Can also be updated via tts_session.updated event.
Maximum number of characters in partial text before forcing TTS generation even without a sentence ending. Helps reduce latency for long text without punctuation.
Language or locale of input text. Accepts ISO 639-1 language codes (e.g., en, fr, es, zh) as well as locale codes for region-specific variants. Locale codes must be lowercase (e.g., zh-hk for Cantonese). Can also be set via tts_session.updated event.
"en"
Response
Switching Protocols - WebSocket connection established successfully.
Error message format:
{
"type": "conversation.item.tts.failed",
"error": {
"message": "Error description",
"type": "invalid_request_error",
"param": null,
"code": "error_code"
}
}