Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The Voice live API provides real-time, bidirectional communication for voice-enabled applications using WebSocket connections. This API supports advanced features including speech recognition, text-to-speech synthesis, avatar streaming, animation data, and comprehensive audio processing capabilities.
The API uses JSON-formatted events sent over WebSocket connections to manage conversations, audio streams, avatar interactions, and real-time responses. Events are categorized into client events (sent from client to server) and server events (sent from server to client).
Key Features
- Real-time Audio Processing: Support for multiple audio formats including PCM16 at various sample rates and G.711 codecs
- Advanced Voice Options: OpenAI voices, Azure custom voices, Azure standard voices, and Azure personal voices
- Avatar Integration: WebRTC-based avatar streaming with video, animation, and blendshapes
- Intelligent Turn Detection: Multiple VAD options including Azure semantic VAD and server-side detection
- Audio Enhancement: Built-in noise reduction and echo cancellation
- Function Calling: Tool integration for enhanced conversational capabilities
- Flexible Session Management: Configurable modalities, instructions, and response parameters
Client Events
The Voice live API supports the following client events that can be sent from the client to the server:
| Event | Description |
|---|---|
| session.update | Update the session configuration including voice, modalities, turn detection, and other settings |
| session.avatar.connect | Establish avatar connection by providing client SDP for WebRTC negotiation |
| input_audio_buffer.append | Append audio bytes to the input audio buffer |
| input_audio_buffer.commit | Commit the input audio buffer for processing |
| input_audio_buffer.clear | Clear the input audio buffer |
| conversation.item.create | Add a new item to the conversation context |
| conversation.item.retrieve | Retrieve a specific item from the conversation |
| conversation.item.truncate | Truncate an assistant audio message |
| conversation.item.delete | Remove an item from the conversation |
| response.create | Instruct the server to create a response via model inference |
| response.cancel | Cancel an in-progress response |
| mcp_approval_response | Send approval or rejection for an MCP tool call that requires approval |
session.update
Update the session's configuration. This event can be sent at any time to modify settings such as voice, modalities, turn detection, tools, and other session parameters. Note that once a session is initialized with a particular model, it can't be changed to another model.
Event Structure
{
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": {
"type": "openai",
"name": "alloy"
},
"instructions": "You are a helpful assistant. Be concise and friendly.",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_sampling_rate": 24000,
"turn_detection": {
"type": "azure_semantic_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"temperature": 0.8,
"max_response_output_tokens": "inf"
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "session.update" |
| session | RealtimeRequestSession | Session configuration object with fields to update |
Example with Azure Custom Voice
{
"type": "session.update",
"session": {
"voice": {
"type": "azure-custom",
"name": "my-custom-voice",
"endpoint_id": "12345678-1234-1234-1234-123456789012",
"temperature": 0.7,
"style": "cheerful"
},
"input_audio_noise_reduction": {
"type": "azure_deep_noise_suppression"
},
"avatar": {
"character": "lisa",
"customized": false,
"video": {
"resolution": {
"width": 1920,
"height": 1080
},
"bitrate": 2000000
}
}
}
}
session.avatar.connect
Establish an avatar connection by providing the client's SDP (Session Description Protocol) offer for WebRTC media negotiation. This event is required when using avatar features.
Event Structure
{
"type": "session.avatar.connect",
"client_sdp": "<client_sdp>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "session.avatar.connect" |
| client_sdp | string | The client's SDP offer for WebRTC connection establishment |
input_audio_buffer.append
Append audio bytes to the input audio buffer.
Event Structure
{
"type": "input_audio_buffer.append",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA="
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "input_audio_buffer.append" |
| audio | string | Base64-encoded audio data |
input_audio_buffer.commit
Commit the input audio buffer for processing.
Event Structure
{
"type": "input_audio_buffer.commit"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "input_audio_buffer.commit" |
input_audio_buffer.clear
Clear the input audio buffer.
Event Structure
{
"type": "input_audio_buffer.clear"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "input_audio_buffer.clear" |
conversation.item.create
Add a new item to the conversation context. This can include messages, function calls, and function call responses. Items can be inserted at specific positions in the conversation history.
Event Structure
{
"type": "conversation.item.create",
"previous_item_id": "item_ABC123",
"item": {
"id": "item_DEF456",
"type": "message",
"role": "user",
"content": [
{
"type": "input_text",
"text": "Hello, how are you?"
}
]
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "conversation.item.create" |
| previous_item_id | string | Optional. ID of the item after which to insert this item. If not provided, appends to end |
| item | RealtimeConversationRequestItem | The item to add to the conversation |
Example with Audio Content
{
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "user",
"content": [
{
"type": "input_audio",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA=",
"transcript": "Hello there"
}
]
}
}
Example with Function Call
{
"type": "conversation.item.create",
"item": {
"type": "function_call",
"name": "get_weather",
"call_id": "call_123",
"arguments": "{\"location\": \"San Francisco\", \"unit\": \"celsius\"}"
}
}
Example with MCP call
{
"type": "conversation.item.create",
"item": {
"type": "mcp_call",
"approval_request_id": null,
"arguments": "",
"server_label": "deepwiki",
"name": "ask_question",
"output": null,
"error": null
}
}
conversation.item.retrieve
Retrieve a specific item from the conversation history. This is useful for inspecting processed audio after noise cancellation and VAD.
Event Structure
{
"type": "conversation.item.retrieve",
"item_id": "item_ABC123"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "conversation.item.retrieve" |
| item_id | string | The ID of the item to retrieve |
conversation.item.truncate
Truncate an assistant message's audio content. This is useful for stopping playback at a specific point and synchronizing the server's understanding with the client's state.
Event Structure
{
"type": "conversation.item.truncate",
"item_id": "item_ABC123",
"content_index": 0,
"audio_end_ms": 5000
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "conversation.item.truncate" |
| item_id | string | The ID of the assistant message item to truncate |
| content_index | integer | The index of the content part to truncate |
| audio_end_ms | integer | The duration up to which to truncate the audio, in milliseconds |
conversation.item.delete
Remove an item from the conversation history.
Event Structure
{
"type": "conversation.item.delete",
"item_id": "item_ABC123"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "conversation.item.delete" |
| item_id | string | The ID of the item to delete |
response.create
Instruct the server to create a response via model inference. This event can specify response-specific configuration that overrides session defaults.
Event Structure
{
"type": "response.create",
"response": {
"modalities": ["text", "audio"],
"instructions": "Be extra helpful and detailed.",
"voice": {
"type": "openai",
"name": "alloy"
},
"output_audio_format": "pcm16",
"temperature": 0.7,
"max_response_output_tokens": 1000
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.create" |
| response | RealtimeResponseOptions | Optional response configuration that overrides session defaults |
Example with Tool Choice
{
"type": "response.create",
"response": {
"modalities": ["text"],
"tools": [
{
"type": "function",
"name": "get_current_time",
"description": "Get the current time",
"parameters": {
"type": "object",
"properties": {}
}
}
],
"tool_choice": "get_current_time",
"temperature": 0.3
}
}
Example with Animation
{
"type": "response.create",
"response": {
"modalities": ["audio", "animation"],
"animation": {
"model_name": "default",
"outputs": ["blendshapes", "viseme_id"]
},
"voice": {
"type": "azure-custom",
"name": "my-expressive-voice",
"endpoint_id": "12345678-1234-1234-1234-123456789012",
"style": "excited"
}
}
}
response.cancel
Cancel an in-progress response. This immediately stops response generation and related audio output.
Event Structure
{
"type": "response.cancel"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.cancel" |
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be conversation.item.retrieve. |
| item_id | string | The ID of the item to retrieve. |
| event_id | string | The ID of the event. |
RealtimeClientEventConversationItemTruncate
The client conversation.item.truncate event is used to truncate a previous assistant message's audio. The server produces audio faster than realtime, so this event is useful when the user interrupts to truncate audio that was sent to the client but not yet played. The server's understanding of the audio with the client's playback is synchronized.
Truncating audio deletes the server-side text transcript to ensure there isn't text in the context that the user doesn't know about.
If the client event is successful, the server responds with a conversation.item.truncated event.
Event structure
{
"type": "conversation.item.truncate",
"item_id": "<item_id>",
"content_index": 0,
"audio_end_ms": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be conversation.item.truncate. |
| item_id | string | The ID of the assistant message item to truncate. Only assistant message items can be truncated. |
| content_index | integer | The index of the content part to truncate. Set this property to "0". |
| audio_end_ms | integer | Inclusive duration up to which audio is truncated, in milliseconds. If the audio_end_ms is greater than the actual audio duration, the server responds with an error. |
RealtimeClientEventInputAudioBufferAppend
The client input_audio_buffer.append event is used to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit.
In Server VAD (Voice Activity Detection) mode, the audio buffer is used to detect speech and the server decides when to commit. When server VAD is disabled, the client can choose how much audio to place in each event up to a maximum of 15 MiB. For example, streaming smaller chunks from the client can allow the VAD to be more responsive.
Unlike most other client events, the server doesn't send a confirmation response to client input_audio_buffer.append event.
Event structure
{
"type": "input_audio_buffer.append",
"audio": "<audio>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be input_audio_buffer.append. |
| audio | string | Base64-encoded audio bytes. This value must be in the format specified by the input_audio_format field in the session configuration. |
RealtimeClientEventInputAudioBufferClear
The client input_audio_buffer.clear event is used to clear the audio bytes in the buffer.
The server responds with an input_audio_buffer.cleared event.
Event structure
{
"type": "input_audio_buffer.clear"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be input_audio_buffer.clear. |
RealtimeClientEventInputAudioBufferCommit
The client input_audio_buffer.commit event is used to commit the user input audio buffer, which creates a new user message item in the conversation. Audio is transcribed if input_audio_transcription is configured for the session.
When in server VAD mode, the client doesn't need to send this event, the server commits the audio buffer automatically. Without server VAD, the client must commit the audio buffer to create a user message item. This client event produces an error if the input audio buffer is empty.
Committing the input audio buffer doesn't create a response from the model.
The server responds with an input_audio_buffer.committed event.
Event structure
{
"type": "input_audio_buffer.commit"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be input_audio_buffer.commit. |
RealtimeClientEventResponseCancel
The client response.cancel event is used to cancel an in-progress response.
The server will respond with a response.done event with a status of response.status=cancelled.
Event structure
{
"type": "response.cancel"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.cancel. |
RealtimeClientEventResponseCreate
The client response.create event is used to instruct the server to create a response via model inference. When the session is configured in server VAD mode, the server creates responses automatically.
A response includes at least one item, and can have two, in which case the second is a function call. These items are appended to the conversation history.
The server responds with a response.created event, one or more item and content events (such as conversation.item.created and response.content_part.added), and finally a response.done event to indicate the response is complete.
Event structure
{
"type": "response.create"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.create. |
| response | RealtimeResponseOptions | The response options. |
RealtimeClientEventSessionUpdate
The client session.update event is used to update the session's default configuration. The client can send this event at any time to update the session configuration, and any field can be updated at any time, except for voice.
Only fields that are present are updated. To clear a field (such as instructions), pass an empty string.
The server responds with a session.updated event that contains the full effective configuration.
Event structure
{
"type": "session.update"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be session.update. |
| session | RealtimeRequestSession | The session configuration. |
Server Events
The Voice live API sends the following server events to communicate status, responses, and data to the client:
| Event | Description |
|---|---|
| error | Indicates an error occurred during processing |
| session.created | Sent when a new session is successfully established |
| session.updated | Sent when session configuration is updated |
| session.avatar.connecting | Indicates avatar WebRTC connection is being established |
| conversation.item.created | Sent when a new item is added to the conversation |
| conversation.item.retrieved | Response to conversation.item.retrieve request |
| conversation.item.truncated | Confirms item truncation |
| conversation.item.deleted | Confirms item deletion |
| conversation.item.input_audio_transcription.completed | Input audio transcription is complete |
| conversation.item.input_audio_transcription.delta | Streaming input audio transcription |
| conversation.item.input_audio_transcription.failed | Input audio transcription failed |
| input_audio_buffer.committed | Input audio buffer has been committed for processing |
| input_audio_buffer.cleared | Input audio buffer has been cleared |
| input_audio_buffer.speech_started | Speech detected in input audio buffer (VAD) |
| input_audio_buffer.speech_stopped | Speech ended in input audio buffer (VAD) |
| response.created | New response generation has started |
| response.done | Response generation is complete |
| response.output_item.added | New output item added to response |
| response.output_item.done | Output item is complete |
| response.content_part.added | New content part added to output item |
| response.content_part.done | Content part is complete |
| response.text.delta | Streaming text content from the model |
| response.text.done | Text content is complete |
| response.audio_transcript.delta | Streaming audio transcript |
| response.audio_transcript.done | Audio transcript is complete |
| response.audio.delta | Streaming audio content from the model |
| response.audio.done | Audio content is complete |
| response.animation_blendshapes.delta | Streaming animation blendshapes data |
| response.animation_blendshapes.done | Animation blendshapes data is complete |
| response.audio_timestamp.delta | Streaming audio timestamp information |
| response.audio_timestamp.done | Audio timestamp information is complete |
| response.animation_viseme.delta | Streaming animation viseme data |
| response.animation_viseme.done | Animation viseme data is complete |
| response.function_call_arguments.delta | Streaming function call arguments |
| response.function_call_arguments.done | Function call arguments are complete |
| mcp_list_tools.in_progress | MCP tool listing is in progress |
| mcp_list_tools.completed | MCP tool listing is completed |
| mcp_list_tools.failed | MCP tool listing has failed |
| response.mcp_call_arguments.delta | Streaming MCP call arguments |
| response.mcp_call_arguments.done | MCP call arguments are complete |
| response.mcp_call.in_progress | MCP call is in progress |
| response.mcp_call.completed | MCP call is completed |
| response.mcp_call.failed | MCP call has failed |
session.created
Sent when a new session is successfully established. This is the first event received after connecting to the API.
Event Structure
{
"type": "session.created",
"session": {
"id": "sess_ABC123DEF456",
"object": "realtime.session",
"model": "gpt-realtime",
"modalities": ["text", "audio"],
"instructions": "You are a helpful assistant.",
"voice": {
"type": "openai",
"name": "alloy"
},
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_sampling_rate": 24000,
"turn_detection": {
"type": "azure_semantic_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"temperature": 0.8,
"max_response_output_tokens": "inf"
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "session.created" |
| session | RealtimeResponseSession | The created session object |
session.updated
Sent when session configuration is successfully updated in response to a session.update client event.
Event Structure
{
"type": "session.updated",
"session": {
"id": "sess_ABC123DEF456",
"voice": {
"type": "azure-custom",
"name": "my-voice",
"endpoint_id": "12345678-1234-1234-1234-123456789012"
},
"temperature": 0.7,
"avatar": {
"character": "lisa",
"customized": false
}
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "session.updated" |
| session | RealtimeResponseSession | The updated session object |
session.avatar.connecting
Indicates that an avatar WebRTC connection is being established. This event is sent in response to a session.avatar.connect client event.
Event Structure
{
"type": "session.avatar.connecting",
"server_sdp": "<server_sdp>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "session.avatar.connecting" |
conversation.item.created
Sent when a new item is added to the conversation, either through a client conversation.item.create event or automatically during response generation.
Event Structure
{
"type": "conversation.item.created",
"previous_item_id": "item_ABC123",
"item": {
"id": "item_DEF456",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "user",
"content": [
{
"type": "input_text",
"text": "Hello, how are you?"
}
]
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "conversation.item.created" |
| previous_item_id | string | ID of the item after which this item was inserted |
| item | RealtimeConversationResponseItem | The created conversation item |
Example with Audio Item
{
"type": "conversation.item.created",
"item": {
"id": "item_GHI789",
"type": "message",
"status": "completed",
"role": "user",
"content": [
{
"type": "input_audio",
"audio": null,
"transcript": "What's the weather like today?"
}
]
}
}
conversation.item.retrieved
Sent in response to a conversation.item.retrieve client event, providing the requested conversation item.
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "conversation.item.created" |
| item | RealtimeConversationResponseItem | The created conversation item |
conversation.item.truncated
The server conversation.item.truncated event is returned when the client truncates an earlier assistant audio message item with a conversation.item.truncate event. This event is used to synchronize the server's understanding of the audio with the client's playback.
This event truncates the audio and removes the server-side text transcript to ensure there's no text in the context that the user doesn't know about.
Event structure
{
"type": "conversation.item.truncated",
"item_id": "<item_id>",
"content_index": 0,
"audio_end_ms": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be conversation.item.truncated. |
| item_id | string | The ID of the assistant message item that was truncated. |
| content_index | integer | The index of the content part that was truncated. |
| audio_end_ms | integer | The duration up to which the audio was truncated, in milliseconds. |
conversation.item.deleted
Sent in response to a conversation.item.delete client event, confirming that the specified item has been removed from the conversation.
Event Structure
{
"type": "conversation.item.deleted",
"item_id": "item_ABC123"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "conversation.item.deleted" |
| item_id | string | ID of the deleted item |
response.created
Sent when a new response generation begins. This is the first event in a response sequence.
Event Structure
{
"type": "response.created",
"response": {
"id": "resp_ABC123",
"object": "realtime.response",
"status": "in_progress",
"status_details": null,
"output": [],
"usage": {
"total_tokens": 0,
"input_tokens": 0,
"output_tokens": 0
}
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.created" |
| response | RealtimeResponse | The response object that was created |
response.done
Sent when response generation is complete. This event contains the final response with all output items and usage statistics.
Event Structure
{
"type": "response.done",
"response": {
"id": "resp_ABC123",
"object": "realtime.response",
"status": "completed",
"status_details": null,
"output": [
{
"id": "item_DEF456",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! I'm doing well, thank you for asking. How can I help you today?"
}
]
}
],
"usage": {
"total_tokens": 87,
"input_tokens": 52,
"output_tokens": 35,
"input_token_details": {
"cached_tokens": 0,
"text_tokens": 45,
"audio_tokens": 7
},
"output_token_details": {
"text_tokens": 15,
"audio_tokens": 20
}
}
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.done" |
| response | RealtimeResponse | The completed response object |
response.output_item.added
Sent when a new output item is added to the response during generation.
Event Structure
{
"type": "response.output_item.added",
"response_id": "resp_ABC123",
"output_index": 0,
"item": {
"id": "item_DEF456",
"object": "realtime.item",
"type": "message",
"status": "in_progress",
"role": "assistant",
"content": []
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.output_item.added" |
| response_id | string | ID of the response this item belongs to |
| output_index | integer | Index of the item in the response's output array |
| item | RealtimeConversationResponseItem | The output item that was added |
response.output_item.done
Sent when an output item is complete.
Event Structure
{
"type": "response.output_item.done",
"response_id": "resp_ABC123",
"output_index": 0,
"item": {
"id": "item_DEF456",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! I'm doing well, thank you for asking."
}
]
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.output_item.done" |
| response_id | string | ID of the response this item belongs to |
| output_index | integer | Index of the item in the response's output array |
| item | RealtimeConversationResponseItem | The completed output item |
response.content_part.added
The server response.content_part.added event is returned when a new content part is added to an assistant message item during response generation.
Event Structure
{
"type": "response.content_part.added",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"part": {
"type": "text",
"text": ""
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.content_part.added" |
| response_id | string | ID of the response |
| item_id | string | ID of the item this content part belongs to |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of this content part in the item |
| part | RealtimeContentPart | The content part that was added |
response.content_part.done
The server response.content_part.done event is returned when a content part is done streaming in an assistant message item.
This event is also returned when a response is interrupted, incomplete, or canceled.
Event Structure
{
"type": "response.content_part.done",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"part": {
"type": "text",
"text": "Hello! I'm doing well, thank you for asking."
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.content_part.done" |
| response_id | string | ID of the response |
| item_id | string | ID of the item this content part belongs to |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of this content part in the item |
| part | RealtimeContentPart | The completed content part |
response.text.delta
Streaming text content from the model. Sent incrementally as the model generates text.
Event Structure
{
"type": "response.text.delta",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"delta": "Hello! I'm"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.text.delta" |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| delta | string | Incremental text content |
response.text.done
Sent when text content generation is complete.
Event Structure
{
"type": "response.text.done",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"text": "Hello! I'm doing well, thank you for asking. How can I help you today?"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.text.done" |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| text | string | The complete text content |
response.audio.delta
Streaming audio content from the model. Audio is provided as base64-encoded data.
Event Structure
{
"type": "response.audio.delta",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"delta": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA="
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.audio.delta" |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| delta | string | Base64-encoded audio data chunk |
response.audio.done
Sent when audio content generation is complete.
Event Structure
{
"type": "response.audio.done",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.audio.done" |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
response.audio_transcript.delta
Streaming transcript of the generated audio content.
Event Structure
{
"type": "response.audio_transcript.delta",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"delta": "Hello! I'm doing"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.audio_transcript.delta" |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| delta | string | Incremental transcript text |
response.audio_transcript.done
Sent when audio transcript generation is complete.
Event Structure
{
"type": "response.audio_transcript.done",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"transcript": "Hello! I'm doing well, thank you for asking. How can I help you today?"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | Must be "response.audio_transcript.done" |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| transcript | string | The complete transcript text |
conversation.item.input_audio_transcription.completed
The server conversation.item.input_audio_transcription.completed event is the result of audio transcription for speech written to the audio buffer.
Transcription begins when the input audio buffer is committed by the client or server (in server_vad mode). Transcription runs asynchronously with response creation, so this event can come before or after the response events.
Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate speech recognition model such as whisper-1. Thus the transcript can diverge somewhat from the model's interpretation, and should be treated as a rough guide.
Event structure
{
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "<item_id>",
"content_index": 0,
"transcript": "<transcript>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be conversation.item.input_audio_transcription.completed. |
| item_id | string | The ID of the user message item containing the audio. |
| content_index | integer | The index of the content part containing the audio. |
| transcript | string | The transcribed text. |
conversation.item.input_audio_transcription.delta
The server conversation.item.input_audio_transcription.delta event is returned when input audio transcription is configured, and a transcription request for a user message is in progress. This event provides partial transcription results as they become available.
Event structure
{
"type": "conversation.item.input_audio_transcription.delta",
"item_id": "<item_id>",
"content_index": 0,
"delta": "<delta>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be conversation.item.input_audio_transcription.delta. |
| item_id | string | The ID of the user message item. |
| content_index | integer | The index of the content part containing the audio. |
| delta | string | The incremental transcription text. |
conversation.item.input_audio_transcription.failed
The server conversation.item.input_audio_transcription.failed event is returned when input audio transcription is configured, and a transcription request for a user message failed. This event is separate from other error events so that the client can identify the related item.
Event structure
{
"type": "conversation.item.input_audio_transcription.failed",
"item_id": "<item_id>",
"content_index": 0,
"error": {
"code": "<code>",
"message": "<message>",
"param": "<param>"
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be conversation.item.input_audio_transcription.failed. |
| item_id | string | The ID of the user message item. |
| content_index | integer | The index of the content part containing the audio. |
| error | object | Details of the transcription error. See nested properties in the next table. |
Error properties
| Field | Type | Description |
|---|---|---|
| type | string | The type of error. |
| code | string | Error code, if any. |
| message | string | A human-readable error message. |
| param | string | Parameter related to the error, if any. |
response.animation_blendshapes.delta
The server response.animation_blendshapes.delta event is returned when the model generates animation blendshapes data as part of a response. This event provides incremental blendshapes data as it becomes available.
Event structure
{
"type": "response.animation_blendshapes.delta",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"frame_index": 0,
"frames": [
[0.0, 0.1, 0.2, ..., 1.0]
...
]
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.animation_blendshapes.delta. |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| frame_index | integer | Index of the first frame in this batch of frames |
| frames | array of array of float | Array of blendshape frames, each frame is an array of blendshape values |
response.animation_blendshapes.done
The server response.animation_blendshapes.done event is returned when the model has finished generating animation blendshapes data as part of a response.
Event structure
{
"type": "response.animation_blendshapes.done",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.animation_blendshapes.done. |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
response.audio_timestamp.delta
The server response.audio_timestamp.delta event is returned when the model generates audio timestamp data as part of a response. This event provides incremental timestamp data for output audio and text alignment as it becomes available.
Event structure
{
"type": "response.audio_timestamp.delta",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"audio_offset_ms": 0,
"audio_duration_ms": 500,
"text": "Hello",
"timestamp_type": "word"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.audio_timestamp.delta. |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| audio_offset_ms | integer | Audio offset in milliseconds from the start of the audio |
| audio_duration_ms | integer | Duration of the audio segment in milliseconds |
| text | string | The text segment corresponding to this audio timestamp |
| timestamp_type | string | The type of timestamp, currently only "word" is supported |
response.audio_timestamp.done
Sent when audio timestamp generation is complete.
Event Structure
{
"type": "response.audio_timestamp.done",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.audio_timestamp.done. |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
response.animation_viseme.delta
The server response.animation_viseme.delta event is returned when the model generates animation viseme data as part of a response. This event provides incremental viseme data as it becomes available.
Event Structure
{
"type": "response.animation_viseme.delta",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0,
"audio_offset_ms": 0,
"viseme_id": 1
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.animation_viseme.delta. |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
| audio_offset_ms | integer | Audio offset in milliseconds from the start of the audio |
| viseme_id | integer | The viseme ID corresponding to the mouth shape for animation |
response.animation_viseme.done
The server response.animation_viseme.done event is returned when the model has finished generating animation viseme data as part of a response.
Event Structure
{
"type": "response.animation_viseme.done",
"response_id": "resp_ABC123",
"item_id": "item_DEF456",
"output_index": 0,
"content_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.animation_viseme.done. |
| response_id | string | ID of the response |
| item_id | string | ID of the item |
| output_index | integer | Index of the item in the response |
| content_index | integer | Index of the content part |
The server response.animation_viseme.delta event is returned when the model generates animation viseme data as part of a response. This event provides incremental viseme data as it becomes available.
error
The server error event is returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session stays open.
Event structure
{
"type": "error",
"error": {
"code": "<code>",
"message": "<message>",
"param": "<param>",
"event_id": "<event_id>"
}
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be error. |
| error | object | Details of the error. See nested properties in the next table. |
Error properties
| Field | Type | Description |
|---|---|---|
| type | string | The type of error. For example, "invalid_request_error" and "server_error" are error types. |
| code | string | Error code, if any. |
| message | string | A human-readable error message. |
| param | string | Parameter related to the error, if any. |
| event_id | string | The ID of the client event that caused the error, if applicable. |
input_audio_buffer.cleared
The server input_audio_buffer.cleared event is returned when the client clears the input audio buffer with a input_audio_buffer.clear event.
Event structure
{
"type": "input_audio_buffer.cleared"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be input_audio_buffer.cleared. |
input_audio_buffer.committed
The server input_audio_buffer.committed event is returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The item_id property is the ID of the user message item created. Thus a conversation.item.created event is also sent to the client.
Event structure
{
"type": "input_audio_buffer.committed",
"previous_item_id": "<previous_item_id>",
"item_id": "<item_id>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be input_audio_buffer.committed. |
| previous_item_id | string | The ID of the preceding item after which the new item is inserted. |
| item_id | string | The ID of the user message item created. |
input_audio_buffer.speech_started
The server input_audio_buffer.speech_started event is returned in server_vad mode when speech is detected in the audio buffer. This event can happen any time audio is added to the buffer (unless speech is already detected).
Note
The client might want to use this event to interrupt audio playback or provide visual feedback to the user.
The client should expect to receive a input_audio_buffer.speech_stopped event when speech stops. The item_id property is the ID of the user message item created when speech stops. The item_id is also included in the input_audio_buffer.speech_stopped event unless the client manually commits the audio buffer during VAD activation.
Event structure
{
"type": "input_audio_buffer.speech_started",
"audio_start_ms": 0,
"item_id": "<item_id>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be input_audio_buffer.speech_started. |
| audio_start_ms | integer | Milliseconds from the start of all audio written to the buffer during the session when speech was first detected. This property corresponds to the beginning of audio sent to the model, and thus includes the prefix_padding_ms configured in the session. |
| item_id | string | The ID of the user message item created when speech stops. |
input_audio_buffer.speech_stopped
The server input_audio_buffer.speech_stopped event is returned in server_vad mode when the server detects the end of speech in the audio buffer.
The server also sends a conversation.item.created event with the user message item created from the audio buffer.
Event structure
{
"type": "input_audio_buffer.speech_stopped",
"audio_end_ms": 0,
"item_id": "<item_id>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be input_audio_buffer.speech_stopped. |
| audio_end_ms | integer | Milliseconds since the session started when speech stopped. This property corresponds to the end of audio sent to the model, and thus includes the min_silence_duration_ms configured in the session. |
| item_id | string | The ID of the user message item created. |
rate_limits.updated
The server rate_limits.updated event is emitted at the beginning of a response to indicate the updated rate limits.
When a response is created, some tokens are reserved for the output tokens. The rate limits shown here reflect that reservation, which is then adjusted accordingly once the response is completed.
Event structure
{
"type": "rate_limits.updated",
"rate_limits": [
{
"name": "<name>",
"limit": 0,
"remaining": 0,
"reset_seconds": 0
}
]
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be rate_limits.updated. |
| rate_limits | array of RealtimeRateLimitsItem | The list of rate limit information. |
response.audio.delta
The server response.audio.delta event is returned when the model-generated audio is updated.
Event structure
{
"type": "response.audio.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"delta": "<delta>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.audio.delta. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the item. |
| output_index | integer | The index of the output item in the response. |
| content_index | integer | The index of the content part in the item's content array. |
| delta | string | Base64-encoded audio data delta. |
response.audio.done
The server response.audio.done event is returned when the model-generated audio is done.
This event is also returned when a response is interrupted, incomplete, or canceled.
Event structure
{
"type": "response.audio.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.audio.done. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the item. |
| output_index | integer | The index of the output item in the response. |
| content_index | integer | The index of the content part in the item's content array. |
response.audio_transcript.delta
The server response.audio_transcript.delta event is returned when the model-generated transcription of audio output is updated.
Event structure
{
"type": "response.audio_transcript.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"delta": "<delta>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.audio_transcript.delta. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the item. |
| output_index | integer | The index of the output item in the response. |
| content_index | integer | The index of the content part in the item's content array. |
| delta | string | The transcript delta. |
response.audio_transcript.done
The server response.audio_transcript.done event is returned when the model-generated transcription of audio output is done streaming.
This event is also returned when a response is interrupted, incomplete, or canceled.
Event structure
{
"type": "response.audio_transcript.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"transcript": "<transcript>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.audio_transcript.done. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the item. |
| output_index | integer | The index of the output item in the response. |
| content_index | integer | The index of the content part in the item's content array. |
| transcript | string | The final transcript of the audio. |
response.function_call_arguments.delta
The server response.function_call_arguments.delta event is returned when the model-generated function call arguments are updated.
Event structure
{
"type": "response.function_call_arguments.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"call_id": "<call_id>",
"delta": "<delta>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.function_call_arguments.delta. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the function call item. |
| output_index | integer | The index of the output item in the response. |
| call_id | string | The ID of the function call. |
| delta | string | The arguments delta as a JSON string. |
response.function_call_arguments.done
The server response.function_call_arguments.done event is returned when the model-generated function call arguments are done streaming.
This event is also returned when a response is interrupted, incomplete, or canceled.
Event structure
{
"type": "response.function_call_arguments.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"call_id": "<call_id>",
"arguments": "<arguments>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.function_call_arguments.done. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the function call item. |
| output_index | integer | The index of the output item in the response. |
| call_id | string | The ID of the function call. |
| arguments | string | The final arguments as a JSON string. |
mcp_list_tools.in_progress
The server mcp_list_tools.in_progress event is returned when the service starts listing available tools from a mcp server.
Event structure
{
"type": "mcp_list_tools.in_progress",
"item_id": "<mcp_list_tools_item_id>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be mcp_list_tools.in_progress. |
| item_id | string | The ID of the MCP list tools item being processed. |
mcp_list_tools.completed
The server mcp_list_tools.completed event is returned when the service completes listing available tools from a mcp server.
Event structure
{
"type": "mcp_list_tools.completed",
"item_id": "<mcp_list_tools_item_id>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be mcp_list_tools.completed. |
| item_id | string | The ID of the MCP list tools item being processed. |
mcp_list_tools.failed
The server mcp_list_tools.failed event is returned when the service fails to list available tools from a mcp server.
Event structure
{
"type": "mcp_list_tools.failed",
"item_id": "<mcp_list_tools_item_id>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be mcp_list_tools.failed. |
| item_id | string | The ID of the MCP list tools item being processed. |
response.mcp_call_arguments.delta
The server response.mcp_call_arguments.delta event is returned when the model-generated mcp tool call arguments are updated.
Event structure
{
"type": "response.mcp_call_arguments.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"delta": "<delta>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.mcp_call_arguments.delta. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the mcp tool call item. |
| output_index | integer | The index of the output item in the response. |
| delta | string | The arguments delta as a JSON string. |
response.mcp_call_arguments.done
The server response.mcp_call_arguments.done event is returned when the model-generated mcp tool call arguments are done streaming.
Event structure
{
"type": "response.mcp_call_arguments.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"arguments": "<arguments>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.mcp_call_arguments.done. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the mcp tool call item. |
| output_index | integer | The index of the output item in the response. |
| arguments | string | The final arguments as a JSON string. |
response.mcp_call.in_progress
The server response.mcp_call.in_progress event is returned when an MCP tool call starts processing.
Event structure
{
"type": "response.mcp_call.in_progress",
"item_id": "<item_id>",
"output_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.mcp_call.in_progress. |
| item_id | string | The ID of the mcp tool call item. |
| output_index | integer | The index of the output item in the response. |
response.mcp_call.completed
The server response.mcp_call.completed event is returned when an MCP tool call completes successfully.
Event structure
{
"type": "response.mcp_call.completed",
"item_id": "<item_id>",
"output_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.mcp_call.completed. |
| item_id | string | The ID of the mcp tool call item. |
| output_index | integer | The index of the output item in the response. |
response.mcp_call.failed
The server response.mcp_call.failed event is returned when an MCP tool call fails.
Event structure
{
"type": "response.mcp_call.failed",
"item_id": "<item_id>",
"output_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.mcp_call.failed. |
| item_id | string | The ID of the mcp tool call item. |
| output_index | integer | The index of the output item in the response. |
response.output_item.added
The server response.output_item.added event is returned when a new item is created during response generation.
Event structure
{
"type": "response.output_item.added",
"response_id": "<response_id>",
"output_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.output_item.added. |
| response_id | string | The ID of the response to which the item belongs. |
| output_index | integer | The index of the output item in the response. |
| item | RealtimeConversationResponseItem | The item that was added. |
response.output_item.done
The server response.output_item.done event is returned when an item is done streaming.
This event is also returned when a response is interrupted, incomplete, or canceled.
Event structure
{
"type": "response.output_item.done",
"response_id": "<response_id>",
"output_index": 0
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.output_item.done. |
| response_id | string | The ID of the response to which the item belongs. |
| output_index | integer | The index of the output item in the response. |
| item | RealtimeConversationResponseItem | The item that is done streaming. |
response.text.delta
The server response.text.delta event is returned when the model-generated text is updated. The text corresponds to the text content part of an assistant message item.
Event structure
{
"type": "response.text.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"delta": "<delta>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.text.delta. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the item. |
| output_index | integer | The index of the output item in the response. |
| content_index | integer | The index of the content part in the item's content array. |
| delta | string | The text delta. |
response.text.done
The server response.text.done event is returned when the model-generated text is done streaming. The text corresponds to the text content part of an assistant message item.
This event is also returned when a response is interrupted, incomplete, or canceled.
Event structure
{
"type": "response.text.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"text": "<text>"
}
Properties
| Field | Type | Description |
|---|---|---|
| type | string | The event type must be response.text.done. |
| response_id | string | The ID of the response. |
| item_id | string | The ID of the item. |
| output_index | integer | The index of the output item in the response. |
| content_index | integer | The index of the content part in the item's content array. |
| text | string | The final text content. |
Components
Audio Formats
RealtimeAudioFormat
Base audio format used for input audio.
Allowed Values:
pcm16- 16-bit PCM audio formatg711_ulaw- G.711 μ-law audio formatg711_alaw- G.711 A-law audio format
RealtimeOutputAudioFormat
Audio format used for output audio with specific sampling rates.
Allowed Values:
pcm16- 16-bit PCM audio format at default sampling rate (24kHz)pcm16_8000hz- 16-bit PCM audio format at 8kHz sampling ratepcm16_16000hz- 16-bit PCM audio format at 16kHz sampling rateg711_ulaw- G.711 μ-law (mu-law) audio format at 8kHz sampling rateg711_alaw- G.711 A-law audio format at 8kHz sampling rate
RealtimeAudioInputTranscriptionSettings
Configuration for input audio transcription.
| Field | Type | Description |
|---|---|---|
| model | string | The transcription model. Supported with gpt-realtime and gpt-realtime-mini:whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize.Supported with all other models and agents: azure-speech |
| language | string | Optional language code in BCP-47 (e.g., en-US), or ISO-639-1 (e.g., en), or multi languages with auto detection, (e.g., en,zh). |
| custom_speech | object | Optional configuration for custom speech models, only valid for azure-speech model. |
| phrase_list | string[] | Optional list of phrase hints to bias recognition, only valid for azure-speech model. |
| prompt | string | Optional prompt text to guide transcription, only valid for whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-transcribe-diarize models. |
RealtimeInputAudioNoiseReductionSettings
This can be:
- An RealtimeOpenAINoiseReduction object
- An RealtimeAzureDeepNoiseSuppression object
RealtimeOpenAINoiseReduction
OpenAI noise reduction configuration with explicit type field, only available for gpt-realtime and gpt-realtime-mini models.
| Field | Type | Description |
|---|---|---|
| type | string | near_field or far_field |
RealtimeAzureDeepNoiseSuppression
Configuration for input audio noise reduction.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "azure_deep_noise_suppression" |
RealtimeInputAudioEchoCancellationSettings
Echo cancellation configuration for server-side audio processing.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "server_echo_cancellation" |
Voice Configuration
RealtimeVoice
Union of all supported voice configurations.
This can be:
- An RealtimeOpenAIVoice object
- An RealtimeAzureVoice object
RealtimeOpenAIVoice
OpenAI voice configuration with explicit type field.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "openai" |
| name | string | OpenAI voice name: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar |
RealtimeAzureVoice
Base for Azure voice configurations. This is a discriminated union with different types:
RealtimeAzureCustomVoice
Azure custom voice configuration (preferred for custom voices).
| Field | Type | Description |
|---|---|---|
| type | string | Must be "azure-custom" |
| name | string | Voice name (cannot be empty) |
| endpoint_id | string | Endpoint ID (cannot be empty) |
| temperature | number | Optional. Temperature between 0.0 and 1.0 |
| custom_lexicon_url | string | Optional. URL to custom lexicon |
| prefer_locales | string[] | Optional. Preferred locales Prefer locales will change the accents of languages. If the value is not set, TTS will use default accent of each language. e.g. When TTS speaking English, it will use the American English accent. And when speaking Spanish, it will use the Mexican Spanish accent. If set the prefer_locales to ["en-GB", "es-ES"], the English accent will be British English and the Spanish accent will be European Spanish. And TTS also able to speak other languages like French, Chinese, etc. |
| locale | string | Optional. Locale specification Enforce The locale for TTS output. If not set, TTS will always use the given locale to speak. e.g. set locale to en-US, TTS will always use American English accent to speak the text content, even the text content is in another language. And TTS will output silence if the text content is in Chinese. |
| style | string | Optional. Voice style |
| pitch | string | Optional. Pitch adjustment |
| rate | string | Optional. Speech rate adjustment |
| volume | string | Optional. Volume adjustment |
Example:
{
"type": "azure-custom",
"name": "my-custom-voice",
"endpoint_id": "12345678-1234-1234-1234-123456789012",
"temperature": 0.7,
"style": "cheerful",
"locale": "en-US"
}
RealtimeAzureStandardVoice
Azure standard voice configuration.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "azure-standard" |
| name | string | Voice name (cannot be empty) |
| temperature | number | Optional. Temperature between 0.0 and 1.0 |
| custom_lexicon_url | string | Optional. URL to custom lexicon |
| prefer_locales | string[] | Optional. Preferred locales |
| locale | string | Optional. Locale specification |
| style | string | Optional. Voice style |
| pitch | string | Optional. Pitch adjustment |
| rate | string | Optional. Speech rate adjustment |
| volume | string | Optional. Volume adjustment |
RealtimeAzurePersonalVoice
Azure personal voice configuration.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "azure-personal" |
| name | string | Voice name (cannot be empty) |
| temperature | number | Optional. Temperature between 0.0 and 1.0 |
| model | string | Underlying neural model: DragonLatestNeural, PhoenixLatestNeural, PhoenixV2Neural |
Turn Detection
RealtimeTurnDetection
Configuration for turn detection. This is a discriminated union supporting multiple VAD types.
RealtimeServerVAD
Base VAD-based turn detection.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "server_vad" |
| threshold | number | Optional. Activation threshold (0.0-1.0) |
| prefix_padding_ms | integer | Optional. Audio padding before speech starts |
| silence_duration_ms | integer | Optional. Silence duration to detect speech end |
| end_of_utterance_detection | RealtimeEOUDetection | Optional. End-of-utterance detection config |
| create_response | boolean | Optional. Enable or disable whether a response is generated. |
| interrupt_response | boolean | Optional. Enable or disable barge-in interruption (default: false) |
| auto_truncate | boolean | Optional. Auto-truncate on interruption (default: false) |
RealtimeOpenAISemanticVAD
OpenAI semantic VAD configuration which uses a model to determine when the user has finished speaking. Only available for gpt-realtime and gpt-realtime-mini models.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "semantic_vad" |
| eagerness | string | Optional. This is a way to control how eager the model is to interrupt the user, tuning the maximum wait timeout. In transcription mode, even if the model doesn't reply, it affects how the audio is chunked. The following values are allowed: - auto (default) is equivalent to medium,- low will let the user take their time to speak,- high will chunk the audio as soon as possible.If you want the model to respond more often in conversation mode, or to return transcription events faster in transcription mode, you can set eagerness to high.On the other hand, if you want to let the user speak uninterrupted in conversation mode, or if you would like larger transcript chunks in transcription mode, you can set eagerness to low. |
| create_response | boolean | Optional. Enable or disable whether a response is generated. |
| interrupt_response | boolean | Optional. Enable or disable barge-in interruption (default: false) |
RealtimeAzureSemanticVAD
Azure semantic VAD, which determines when the user starts and speaking using a semantic speech model, providing more robust detection in noisy environments.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "azure_semantic_vad" |
| threshold | number | Optional. Activation threshold |
| prefix_padding_ms | integer | Optional. Audio padding before speech |
| silence_duration_ms | integer | Optional. Silence duration for speech end |
| end_of_utterance_detection | RealtimeEOUDetection | Optional. EOU detection config |
| speech_duration_ms | integer | Optional. Minimum speech duration |
| remove_filler_words | boolean | Optional. Remove filler words (default: false) |
| languages | string[] | Optional. Supports English. Other languages will be ignored. |
| create_response | boolean | Optional. Enable or disable whether a response is generated. |
| interrupt_response | boolean | Optional. Enable or disable barge-in interruption (default: false) |
| auto_truncate | boolean | Optional. Auto-truncate on interruption (default: false) |
RealtimeAzureSemanticVADMultilingual
Azure semantic VAD (default variant).
| Field | Type | Description |
|---|---|---|
| type | string | Must be "azure_semantic_vad_multilingual" |
| threshold | number | Optional. Activation threshold |
| prefix_padding_ms | integer | Optional. Audio padding before speech |
| silence_duration_ms | integer | Optional. Silence duration for speech end |
| end_of_utterance_detection | RealtimeEOUDetection | Optional. EOU detection config |
| speech_duration_ms | integer | Optional. Minimum speech duration |
| remove_filler_words | boolean | Optional. Remove filler words (default: false). |
| languages | string[] | Optional. Supports English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi. Other languages will be ignored. |
| create_response | boolean | Optional. Enable or disable whether a response is generated. |
| interrupt_response | boolean | Optional. Enable or disable barge-in interruption (default: false) |
| auto_truncate | boolean | Optional. Auto-truncate on interruption (default: false) |
RealtimeEOUDetection
Azure End-of-Utterance (EOU) could indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency.
| Field | Type | Description |
|---|---|---|
| model | string | Could be semantic_detection_v1 supporting English or semantic_detection_v1_multilingual supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi |
| threshold_level | string | Optional. Detection threshold level (low, medium, high and default), the default equals medium setting. With a lower setting the probability the sentence is complete will be higher. |
| timeout_ms | number | Optional. Maximum time in milliseconds to wait for more user speech. Defaults to 1000 ms. |
Avatar Configuration
RealtimeAvatarConfig
Configuration for avatar streaming and behavior.
| Field | Type | Description |
|---|---|---|
| ice_servers | RealtimeIceServer[] | Optional. ICE servers for WebRTC |
| character | string | Character name or ID for the avatar |
| style | string | Optional. Avatar style (emotional tone, speaking style) |
| customized | boolean | Whether the avatar is customized |
| video | RealtimeVideoParams | Optional. Video configuration |
RealtimeIceServer
ICE server configuration for WebRTC connection negotiation.
| Field | Type | Description |
|---|---|---|
| urls | string[] | ICE server URLs (TURN or STUN endpoints) |
| username | string | Optional. Username for authentication |
| credential | string | Optional. Credential for authentication |
RealtimeVideoParams
Video streaming parameters for avatar.
| Field | Type | Description |
|---|---|---|
| bitrate | integer | Optional. Bitrate in bits per second (default: 2000000) |
| codec | string | Optional. Video codec, currently only h264 (default: h264) |
| crop | RealtimeVideoCrop | Optional. Cropping settings |
| resolution | RealtimeVideoResolution | Optional. Resolution settings |
RealtimeVideoCrop
Video crop rectangle definition.
| Field | Type | Description |
|---|---|---|
| top_left | integer[] | Top-left corner [x, y], non-negative integers |
| bottom_right | integer[] | Bottom-right corner [x, y], non-negative integers |
RealtimeVideoResolution
Video resolution specification.
| Field | Type | Description |
|---|---|---|
| width | integer | Width in pixels (must be > 0) |
| height | integer | Height in pixels (must be > 0) |
Animation Configuration
RealtimeAnimation
Configuration for animation outputs including blendshapes and visemes.
| Field | Type | Description |
|---|---|---|
| model_name | string | Optional. Animation model name (default: "default") |
| outputs | RealtimeAnimationOutputType[] | Optional. Output types (default: ["blendshapes"]) |
RealtimeAnimationOutputType
Types of animation data to output.
Allowed Values:
blendshapes- Facial blendshapes dataviseme_id- Viseme identifier data
Session Configuration
RealtimeRequestSession
Session configuration object used in session.update events.
| Field | Type | Description |
|---|---|---|
| model | string | Optional. Model name to use |
| modalities | RealtimeModality[] | Optional. The supported modalities for the session. For example, "modalities": ["text", "audio"] is the default setting that enables both text and audio modalities. To enable only text, set "modalities": ["text"]. To enable avatar output, set "modalities": ["text", "audio", "avatar"]. You can't enable only audio. |
| animation | RealtimeAnimation | Optional. Animation configuration |
| voice | RealtimeVoice | Optional. Voice configuration |
| instructions | string | Optional. System instructions for the model. The instructions could guide the output audio if OpenAI voices are used but may not apply to Azure voices. |
| input_audio_sampling_rate | integer | Optional. Input audio sampling rate in Hz (default: 24000 for pcm16, 8000 for g711_ulaw and g711_alaw) |
| input_audio_format | RealtimeAudioFormat | Optional. Input audio format (default: pcm16) |
| output_audio_format | RealtimeOutputAudioFormat | Optional. Output audio format (default: pcm16) |
| input_audio_noise_reduction | RealtimeInputAudioNoiseReductionSettings | Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio. This property is nullable. |
| input_audio_echo_cancellation | RealtimeInputAudioEchoCancellationSettings | Configuration for input audio echo cancellation. This can be set to null to turn off. This service side echo cancellation can help improve the quality of the input audio by reducing the impact of echo and reverberation. This property is nullable. |
| input_audio_transcription | RealtimeAudioInputTranscriptionSettings | The configuration for input audio transcription. The configuration is null (off) by default. Input audio transcription isn't native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. For additional guidance to the transcription service, the client can optionally set the language and prompt for transcription.This property is nullable. |
| turn_detection | RealtimeTurnDetection | The turn detection settings for the session. This can be set to null to turn off. |
| tools | array of RealtimeTool | The tools available to the model for the session. |
| tool_choice | RealtimeToolChoice | The tool choice for the session. Allowed values: auto, none, and required. Otherwise, you can specify the name of the function to use. |
| temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
| max_response_output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set "max_response_output_tokens": 1000. To allow the maximum number of tokens, set "max_response_output_tokens": "inf".Defaults to "inf". |
| avatar | RealtimeAvatarConfig | Optional. Avatar configuration |
| output_audio_timestamp_types | RealtimeAudioTimestampType[] | Optional. Timestamp types for output audio |
RealtimeModality
Supported session modalities.
Allowed Values:
text- Text input/outputaudio- Audio input/outputanimation- Animation outputavatar- Avatar video output
RealtimeAudioTimestampType
Output timestamp types supported in audio response content.
Allowed Values:
word- Timestamps per word in the output audio
Tool Configuration
We support two types of tools: function calling and MCP tools which allow you connect to a mcp server.
RealtimeTool
Tool definition for function calling.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "function" |
| name | string | Function name |
| description | string | Function description and usage guidelines |
| parameters | object | Function parameters as JSON schema object |
RealtimeToolChoice
Tool selection strategy.
This can be:
"auto"- Let the model choose"none"- Don't use tools"required"- Must use a tool{ "type": "function", "name": "function_name" }- Use specific function
MCPTool
MCP tool configuration.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "mcp" |
| server_label | string | Required. The label of the MCP server. |
| server_url | string | Required. The server URL of the MCP server. |
| allowed_tools | string[] | Optional. The list of allowed tool names. If not specified, all tools are allowed. |
| headers | object | Optional. Additional headers to include in MCP requests. |
| authorization | string | Optional. Authorization token for MCP requests. |
| require_approval | string or dictionary | Optional. If set to a string, The value must be never or always. If set to a dictionary, it must be in format {"never": ["<tool_name_1>", "<tool_name_2>"], "always": ["<tool_name_3>"]}. Default value is always. When set to always, the tool execution requires approval, mcp_approval_request will be sent to client when mcp argument done, and will only be executed when mcp_approval_response with approve=true is received. When set to never, the tool will be executed automatically without approval. |
RealtimeConversationResponseItem
This is a union type that can be one of the following:
RealtimeConversationUserMessageItem
User message item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "message" |
| object | string | Must be "conversation.item" |
| role | string | Must be "user" |
| content | RealtimeInputTextContentPart | The content of the message. |
| status | RealtimeItemStatus | The status of the item. |
RealtimeConversationAssistantMessageItem
Assistant message item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "message" |
| object | string | Must be "conversation.item" |
| role | string | Must be "assistant" |
| content | RealtimeOutputTextContentPart[] or RealtimeOutputAudioContentPart[] | The content of the message. |
| status | RealtimeItemStatus | The status of the item. |
RealtimeConversationSystemMessageItem
System message item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "message" |
| object | string | Must be "conversation.item" |
| role | string | Must be "system" |
| content | RealtimeInputTextContentPart[] | The content of the message. |
| status | RealtimeItemStatus | The status of the item. |
RealtimeConversationFunctionCallItem
Function call request item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "function_call" |
| object | string | Must be "conversation.item" |
| name | string | The name of the function to call. |
| arguments | string | The arguments for the function call as a JSON string. |
| call_id | string | The unique ID of the function call. |
| status | RealtimeItemStatus | The status of the item. |
RealtimeConversationFunctionCallOutputItem
Function call response item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "function_call_output" |
| object | string | Must be "conversation.item" |
| name | string | The name of the function that was called. |
| output | string | The output of the function call. |
| call_id | string | The unique ID of the function call. |
| status | RealtimeItemStatus | The status of the item. |
RealtimeConversationMCPListToolsItem
MCP list tools response item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "mcp_list_tools" |
| server_label | string | The label of the MCP server. |
RealtimeConversationMCPCallItem
MCP call response item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "mcp_call" |
| server_label | string | The label of the MCP server. |
| name | string | The name of the tool to call. |
| approval_request_id | string | The approval request ID for the MCP call. |
| arguments | string | The arguments for the MCP call. |
| output | string | The output of the MCP call. |
| error | object | The error details if the MCP call failed. |
RealtimeConversationMCPApprovalRequestItem
MCP approval request item.
| Field | Type | Description |
|---|---|---|
| id | string | The unique ID of the item. |
| type | string | Must be "mcp_approval_request" |
| server_label | string | The label of the MCP server. |
| name | string | The name of the tool to call. |
| arguments | string | The arguments for the MCP call. |
RealtimeItemStatus
Status of conversation items.
Allowed Values:
in_progress- Currently being processedcompleted- Successfully completedincomplete- Incomplete (interrupted or failed)
RealtimeContentPart
Content part within a message.
RealtimeInputTextContentPart
Text content part.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "input_text" |
| text | string | The text content |
RealtimeOutputTextContentPart
Text content part.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "text" |
| text | string | The text content |
RealtimeInputAudioContentPart
Audio content part.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "input_audio" |
| audio | string | Optional. Base64-encoded audio data |
| transcript | string | Optional. Audio transcript |
RealtimeOutputAudioContentPart
Audio content part.
| Field | Type | Description |
|---|---|---|
| type | string | Must be "audio" |
| audio | string | Base64-encoded audio data |
| transcript | string | Optional. Audio transcript |
Response Objects
RealtimeResponse
Response object representing a model inference response.
| Field | Type | Description |
|---|---|---|
| id | string | Optional. Response ID |
| object | string | Optional. Always "realtime.response" |
| status | RealtimeResponseStatus | Optional. Response status |
| status_details | RealtimeResponseStatusDetails | Optional. Status details |
| output | RealtimeConversationResponseItem[] | Optional. Output items |
| usage | RealtimeUsage | Optional. Token usage statistics |
| conversation_id | string | Optional. Associated conversation ID |
| voice | RealtimeVoice | Optional. Voice used for response |
| modalities | string[] | Optional. Modalities used |
| output_audio_format | RealtimeOutputAudioFormat | Optional. Audio format used |
| temperature | number | Optional. Temperature used |
| max_response_output_tokens | integer or "inf" | Optional. Max tokens used |
RealtimeResponseStatus
Response status values.
Allowed Values:
in_progress- Response is being generatedcompleted- Response completed successfullycancelled- Response was cancelledincomplete- Response incomplete (interrupted)failed- Response failed with error
RealtimeUsage
Token usage statistics.
| Field | Type | Description |
|---|---|---|
| total_tokens | integer | Total tokens used |
| input_tokens | integer | Input tokens used |
| output_tokens | integer | Output tokens generated |
| input_token_details | TokenDetails | Breakdown of input tokens |
| output_token_details | TokenDetails | Breakdown of output tokens |
TokenDetails
Detailed token usage breakdown.
| Field | Type | Description |
|---|---|---|
| cached_tokens | integer | Optional. Cached tokens used |
| text_tokens | integer | Optional. Text tokens used |
| audio_tokens | integer | Optional. Audio tokens used |
Error Handling
RealtimeErrorDetails
Error information object.
| Field | Type | Description |
|---|---|---|
| type | string | Error type (e.g., "invalid_request_error", "server_error") |
| code | string | Optional. Specific error code |
| message | string | Human-readable error description |
| param | string | Optional. Parameter related to the error |
| event_id | string | Optional. ID of the client event that caused the error |
RealtimeConversationRequestItem
You use the RealtimeConversationRequestItem object to create a new item in the conversation via the conversation.item.create event.
This is a union type that can be one of the following:
RealtimeSystemMessageItem
A system message item.
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: message |
| role | string | The role of the message. Allowed values: system |
| content | array of RealtimeInputTextContentPart | The content of the message. |
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
RealtimeUserMessageItem
A user message item.
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: message |
| role | string | The role of the message. Allowed values: user |
| content | array of RealtimeInputTextContentPart or RealtimeInputAudioContentPart | The content of the message. |
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
RealtimeAssistantMessageItem
An assistant message item.
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: message |
| role | string | The role of the message. Allowed values: assistant |
| content | array of RealtimeOutputTextContentPart | The content of the message. |
RealtimeFunctionCallItem
A function call item.
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: function_call |
| name | string | The name of the function to call. |
| arguments | string | The arguments of the function call as a JSON string. |
| call_id | string | The ID of the function call item. |
| id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
RealtimeFunctionCallOutputItem
A function call output item.
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: function_call_output |
| call_id | string | The ID of the function call item. |
| output | string | The output of the function call, this is a free-form string with the function result, also could be empty. |
| id | string | The unique ID of the item. If the client doesn't provide an ID, the server generates one. |
RealtimeMCPApprovalResponseItem
An MCP approval response item.
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: mcp_approval_response |
| approve | boolean | Whether the MCP request is approved. |
| approval_request_id | string | The ID of the MCP approval request. |
RealtimeFunctionTool
The definition of a function tool as used by the realtime endpoint.
| Field | Type | Description |
|---|---|---|
| type | string | The type of the tool. Allowed values: function |
| name | string | The name of the function. |
| description | string | The description of the function, including usage guidelines. For example, "Use this function to get the current time." |
| parameters | object | The parameters of the function in the form of a JSON object. |
RealtimeItemStatus
Allowed Values:
in_progresscompletedincomplete
RealtimeResponseAudioContentPart
| Field | Type | Description |
|---|---|---|
| type | string | The type of the content part. Allowed values: audio |
| transcript | string | The transcript of the audio. This property is nullable. |
RealtimeResponseFunctionCallItem
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: function_call |
| name | string | The name of the function call item. |
| call_id | string | The ID of the function call item. |
| arguments | string | The arguments of the function call item. |
| status | RealtimeItemStatus | The status of the item. |
RealtimeResponseFunctionCallOutputItem
| Field | Type | Description |
|---|---|---|
| type | string | The type of the item. Allowed values: function_call_output |
| call_id | string | The ID of the function call item. |
| output | string | The output of the function call item. |
RealtimeResponseOptions
| Field | Type | Description |
|---|---|---|
| modalities | array | The modalities that the session supports. Allowed values: text, audioFor example, "modalities": ["text", "audio"] is the default setting that enables both text and audio modalities. To enable only text, set "modalities": ["text"]. You can't enable only audio. |
| instructions | string | The instructions (the system message) to guide the model's responses. |
| voice | RealtimeVoice | The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed. |
| tools | array of RealtimeTool | The tools available to the model for the session. |
| tool_choice | RealtimeToolChoice | The tool choice for the session. |
| temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
| max_response_output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set "max_response_output_tokens": 1000. To allow the maximum number of tokens, set "max_response_output_tokens": "inf".Defaults to "inf". |
| conversation | string | Controls which conversation the response is added to. The supported values are auto and none.The auto value (or not setting this property) ensures that the contents of the response are added to the session's default conversation.Set this property to none to create an out-of-band response where items won't be added to the default conversation. Defaults to "auto" |
| metadata | map | Set of up to 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys can be a maximum of 64 characters long and values can be a maximum of 512 characters long. For example: metadata: { topic: "classification" } |
RealtimeResponseSession
The RealtimeResponseSession object represents a session in the Realtime API. It's used in some of the server events, such as:
| Field | Type | Description |
|---|---|---|
| object | string | The session object. Allowed values: realtime.session |
| id | string | The unique ID of the session. |
| model | string | The model used for the session. |
| modalities | array | The modalities that the session supports. Allowed values: text, audioFor example, "modalities": ["text", "audio"] is the default setting that enables both text and audio modalities. To enable only text, set "modalities": ["text"]. You can't enable only audio. |
| instructions | string | The instructions (the system message) to guide the model's text and audio responses. Here are some example instructions to help guide content and format of text and audio responses: "instructions": "be succinct""instructions": "act friendly""instructions": "here are examples of good responses"Here are some example instructions to help guide audio behavior: "instructions": "talk quickly""instructions": "inject emotion into your voice""instructions": "laugh frequently"While the model might not always follow these instructions, they provide guidance on the desired behavior. |
| voice | RealtimeVoice | The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed. |
| input_audio_sampling_rate | integer | The sampling rate for the input audio. |
| input_audio_format | RealtimeAudioFormat | The format for the input audio. |
| output_audio_format | RealtimeAudioFormat | The format for the output audio. |
| input_audio_transcription | RealtimeAudioInputTranscriptionSettings | The settings for audio input transcription. This property is nullable. |
| turn_detection | RealtimeTurnDetection | The turn detection settings for the session. This property is nullable. |
| tools | array of RealtimeTool | The tools available to the model for the session. |
| tool_choice | RealtimeToolChoice | The tool choice for the session. |
| temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
| max_response_output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set "max_response_output_tokens": 1000. To allow the maximum number of tokens, set "max_response_output_tokens": "inf". |
RealtimeResponseStatusDetails
| Field | Type | Description |
|---|---|---|
| type | RealtimeResponseStatus | The status of the response. |
RealtimeRateLimitsItem
| Field | Type | Description |
|---|---|---|
| name | string | The rate limit property name that this item includes information about. |
| limit | integer | The maximum configured limit for this rate limit property. |
| remaining | integer | The remaining quota available against the configured limit for this rate limit property. |
| reset_seconds | number | The remaining time, in seconds, until this rate limit property is reset. |
Related Resources
- Try the Voice live quickstart
- Try the Voice live agents quickstart
- Learn more about How to use the Voice live API