Jump to
Media Stream
When a call starts, Inkbox opens a WebSocket connection to your agent at the client_websocket_url you configured on the phone number or provided when placing the call. This connection carries the live call data between your agent and the caller for the duration of the call.
What flows over this connection (text, audio, or both) depends on how your agent configures itself. Inkbox can handle text-to-speech (TTS), speech-to-text (STT), or both on your behalf. See Choosing a mode below.
Connection flow
-
Inkbox connects to your
client_websocket_urlwith anX-Call-Contextheader containing thecall_id,phone_number, anddirection. If your organization has a signing key, the connection also includesX-Inkbox-Request-ID,X-Inkbox-Timestamp, andX-Inkbox-Signatureheaders. -
Your agent accepts the WebSocket and declares its capabilities via two response headers:
| Header | Default | Description |
|---|---|---|
X-Use-Inkbox-Text-To-Speech | true | If true, Inkbox converts your text responses to speech. If false, your agent sends audio directly. |
X-Use-Inkbox-Speech-To-Text | true | If true, Inkbox transcribes the caller's speech and sends you text. If false, your agent receives raw audio. |
If you omit these headers, both default to true (Inkbox handles everything).
-
startevent is sent to your agent with thecall_control_idand media format details. -
Streaming begins. Text or audio flows bidirectionally depending on the mode.
-
stopevent is sent when the call ends.
Choosing a mode
The combination of the two response headers gives you four configurations. Use the selector below to explore what each mode looks like, including the exact WebSocket events your agent sends and receives.
Who handles what?
Text-to-speech (TTS)
Speech-to-text (STT)
Inkbox handles STT + TTS
Inkbox transcribes the caller and synthesizes your responses. Your agent only deals with text.
Your agent receives
Text (transcribed caller speech)Your agent sends
Text (to be spoken to the caller)Inkbox handles
Speech-to-text + text-to-speechSimplest setup. Ideal when your agent is a text-based LLM and you want Inkbox to handle all audio processing.
WebSocket response headers
Your agent declares this configuration by setting these headers when accepting the WebSocket connection:
X-Use-Inkbox-Text-To-Speech: true
X-Use-Inkbox-Speech-To-Text: trueEvents you receive (Inkbox → your agent)
| Event | Description |
|---|---|
start | Call stream opened. Contains call metadata and media format. |
transcript | Caller speech transcribed by Inkbox. Sent as interim results and a final result per utterance. |
barge_in | The caller started speaking while your agent's TTS was playing. Inkbox interrupts playback. |
stop | Call ended. |
start
transcript
barge_in
stop
Events you send (your agent → Inkbox)
| Event | Description |
|---|---|
text | Stream text to be spoken to the caller. Send chunks with `done: false`, then a final message with `done: true`. Inkbox converts each chunk to speech and plays it. |
stop | Hang up the call from your side. |
text
stop
Audio format
When your agent sends or receives audio (any mode where TTS or STT is handled by your agent), audio is encoded as PCMU (u-law) at 8 kHz, base64-encoded inside JSON messages. This is the standard telephony format used by Telnyx.
Transcripts
Regardless of mode, Inkbox persists call transcripts to the database as the call progresses. In modes where Inkbox handles STT, transcripts are captured automatically. In modes where your agent handles STT, you should send transcript events so Inkbox can persist them. You can retrieve transcripts after the call via the Transcripts API.
Call duration
Each call has a maximum duration of 10 minutes. When the limit is reached, Inkbox hangs up the call with hangup_reason: "max_duration". See Rate Limits for organization-level limits.
Setting your stream URL
Configure client_websocket_url on a phone number so it's used automatically for all auto-accepted calls:
Or provide a client_websocket_url per-call when placing outbound calls or responding to incoming call webhooks.