Jump to

Media Stream

When a call starts, Inkbox opens a WebSocket connection to your agent at the client_websocket_url you configured on the phone number or provided when placing the call. This connection carries the live call data between your agent and the caller for the duration of the call.

What flows over this connection (text, audio, or both) depends on how your agent configures itself. Inkbox can handle text-to-speech (TTS), speech-to-text (STT), or both on your behalf. See Choosing a mode below.

Connection flow

Inkbox connects to your client_websocket_url with an X-Call-Context header containing the call_id, phone_number, and direction. If your organization has a signing key, the connection also includes X-Inkbox-Request-ID, X-Inkbox-Timestamp, and X-Inkbox-Signature headers.
Your agent accepts the WebSocket and declares its capabilities via two response headers:

Header	Default	Description
`X-Use-Inkbox-Text-To-Speech`	`true`	If `true`, Inkbox converts your text responses to speech. If `false`, your agent sends audio directly.
`X-Use-Inkbox-Speech-To-Text`	`true`	If `true`, Inkbox transcribes the caller's speech and sends you text. If `false`, your agent receives raw audio.

If you omit these headers, both default to true (Inkbox handles everything).

start event is sent to your agent with the call_control_id and media format details.
Streaming begins. Text or audio flows bidirectionally depending on the mode.
stop event is sent when the call ends.

Choosing a mode

The combination of the two response headers gives you four configurations. Use the selector below to explore what each mode looks like, including the exact WebSocket events your agent sends and receives.

Who handles what?

Text-to-speech (TTS)

Speech-to-text (STT)

Inkbox handles STT + TTS

Inkbox transcribes the caller and synthesizes your responses. Your agent only deals with text.

Your agent receives

Text (transcribed caller speech)

Your agent sends

Text (to be spoken to the caller)

Inkbox handles

Speech-to-text + text-to-speech

Simplest setup. Ideal when your agent is a text-based LLM and you want Inkbox to handle all audio processing.

WebSocket response headers

Your agent declares this configuration by setting these headers when accepting the WebSocket connection:

X-Use-Inkbox-Text-To-Speech: true
X-Use-Inkbox-Speech-To-Text: true

Events you receive (Inkbox → your agent)

Event	Description
`start`	Call stream opened. Contains call metadata and media format.
`transcript`	Caller speech transcribed by Inkbox. Sent as interim results and a final result per utterance.
`barge_in`	The caller started speaking while your agent's TTS was playing. Inkbox interrupts playback.
`stop`	Call ended.

start

JSON

transcript

JSON

barge_in

JSON

stop

JSON

Events you send (your agent → Inkbox)

Event	Description
`text`	Stream text to be spoken to the caller. Send chunks with `done: false`, then a final message with `done: true`. Inkbox converts each chunk to speech and plays it.
`stop`	Hang up the call from your side.

text

JSON

stop

JSON

Audio format

When your agent sends or receives audio (any mode where TTS or STT is handled by your agent), audio is encoded as PCMU (u-law) at 8 kHz, base64-encoded inside JSON messages. This is the standard telephony format used by Telnyx.

Transcripts

Regardless of mode, Inkbox persists call transcripts to the database as the call progresses. In modes where Inkbox handles STT, transcripts are captured automatically. In modes where your agent handles STT, you should send transcript events so Inkbox can persist them. You can retrieve transcripts after the call via the Transcripts API.

Call duration

Each call has a maximum duration of 10 minutes. When the limit is reached, Inkbox hangs up the call with hangup_reason: "max_duration". See Rate Limits for organization-level limits.

Setting your stream URL

Configure client_websocket_url on a phone number so it's used automatically for all auto-accepted calls:

JSON

Or provide a client_websocket_url per-call when placing outbound calls or responding to incoming call webhooks.