Voice live API Reference

The Voice live API provides real-time, bidirectional communication for voice-enabled applications using WebSocket connections. This API supports advanced features including speech recognition, text-to-speech synthesis, avatar streaming, animation data, and comprehensive audio processing capabilities.

The API uses JSON-formatted events sent over WebSocket connections to manage conversations, audio streams, avatar interactions, and real-time responses. Events are categorized into client events (sent from client to server) and server events (sent from server to client).

Key Features

Real-time Audio Processing: Support for multiple audio formats including PCM16 at various sample rates and G.711 codecs
Advanced Voice Options: OpenAI voices, Azure custom voices, Azure standard voices, and Azure personal voices
Avatar Integration: WebRTC-based avatar streaming with video, animation, and blendshapes
Intelligent Turn Detection: Multiple VAD options including Azure semantic VAD and server-side detection
Audio Enhancement: Built-in noise reduction and echo cancellation
Function Calling: Tool integration for enhanced conversational capabilities
Flexible Session Management: Configurable modalities, instructions, and response parameters

Client Events

The Voice live API supports the following client events that can be sent from the client to the server:

Event	Description
session.update	Update the session configuration including voice, modalities, turn detection, and other settings
session.avatar.connect	Establish avatar connection by providing client SDP for WebRTC negotiation
input_audio_buffer.append	Append audio bytes to the input audio buffer
input_audio_buffer.commit	Commit the input audio buffer for processing
input_audio_buffer.clear	Clear the input audio buffer
conversation.item.create	Add a new item to the conversation context
conversation.item.retrieve	Retrieve a specific item from the conversation
conversation.item.truncate	Truncate an assistant audio message
conversation.item.delete	Remove an item from the conversation
response.create	Instruct the server to create a response via model inference
response.cancel	Cancel an in-progress response
mcp_approval_response	Send approval or rejection for an MCP tool call that requires approval

session.update

Update the session's configuration. This event can be sent at any time to modify settings such as voice, modalities, turn detection, tools, and other session parameters. Note that once a session is initialized with a particular model, it can't be changed to another model.

Event Structure

{
  "type": "session.update",
  "session": {
    "modalities": ["text", "audio"],
    "voice": {
      "type": "openai",
      "name": "alloy"
    },
    "instructions": "You are a helpful assistant. Be concise and friendly.",
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_sampling_rate": 24000,
    "turn_detection": {
      "type": "azure_semantic_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 500
    },
    "temperature": 0.8,
    "max_response_output_tokens": "inf"
  }
}

Properties

Field	Type	Description
type	string	Must be `"session.update"`
session	RealtimeRequestSession	Session configuration object with fields to update

Example with Azure Custom Voice

{
  "type": "session.update",
  "session": {
    "voice": {
      "type": "azure-custom",
      "name": "my-custom-voice",
      "endpoint_id": "12345678-1234-1234-1234-123456789012",
      "temperature": 0.7,
      "style": "cheerful"
    },
    "input_audio_noise_reduction": {
      "type": "azure_deep_noise_suppression"
    },
    "avatar": {
      "character": "lisa",
      "customized": false,
      "video": {
        "resolution": {
          "width": 1920,
          "height": 1080
        },
        "bitrate": 2000000
      }
    }
  }
}

session.avatar.connect

Establish an avatar connection by providing the client's SDP (Session Description Protocol) offer for WebRTC media negotiation. This event is required when using avatar features.

Event Structure

{
  "type": "session.avatar.connect",
  "client_sdp": "<client_sdp>"
}

Properties

Field	Type	Description
type	string	Must be `"session.avatar.connect"`
client_sdp	string	The client's SDP offer for WebRTC connection establishment

input_audio_buffer.append

Append audio bytes to the input audio buffer.

Event Structure

{
  "type": "input_audio_buffer.append",
  "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA="
}

Properties

Field	Type	Description
type	string	Must be `"input_audio_buffer.append"`
audio	string	Base64-encoded audio data

input_audio_buffer.commit

Commit the input audio buffer for processing.

Event Structure

{
  "type": "input_audio_buffer.commit"
}

Properties

Field	Type	Description
type	string	Must be `"input_audio_buffer.commit"`

input_audio_buffer.clear

Clear the input audio buffer.

Event Structure

{
  "type": "input_audio_buffer.clear"
}

Properties

Field	Type	Description
type	string	Must be `"input_audio_buffer.clear"`

conversation.item.create

Add a new item to the conversation context. This can include messages, function calls, and function call responses. Items can be inserted at specific positions in the conversation history.

Event Structure

{
  "type": "conversation.item.create",
  "previous_item_id": "item_ABC123",
  "item": {
    "id": "item_DEF456",
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Hello, how are you?"
      }
    ]
  }
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.create"`
previous_item_id	string	Optional. ID of the item after which to insert this item. If not provided, appends to end
item	RealtimeConversationRequestItem	The item to add to the conversation

Example with Audio Content

{
  "type": "conversation.item.create",
  "item": {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA=",
        "transcript": "Hello there"
      }
    ]
  }
}

Example with Function Call

{
  "type": "conversation.item.create",
  "item": {
    "type": "function_call",
    "name": "get_weather",
    "call_id": "call_123",
    "arguments": "{\"location\": \"San Francisco\", \"unit\": \"celsius\"}"
  }
}

Example with MCP call

{
  "type": "conversation.item.create",
  "item": {
    "type": "mcp_call",
    "approval_request_id": null,
    "arguments": "",
    "server_label": "deepwiki",
    "name": "ask_question",
    "output": null,
    "error": null
  }
}

conversation.item.retrieve

Retrieve a specific item from the conversation history. This is useful for inspecting processed audio after noise cancellation and VAD.

Event Structure

{
  "type": "conversation.item.retrieve",
  "item_id": "item_ABC123"
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.retrieve"`
item_id	string	The ID of the item to retrieve

conversation.item.truncate

Truncate an assistant message's audio content. This is useful for stopping playback at a specific point and synchronizing the server's understanding with the client's state.

Event Structure

{
  "type": "conversation.item.truncate",
  "item_id": "item_ABC123",
  "content_index": 0,
  "audio_end_ms": 5000
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.truncate"`
item_id	string	The ID of the assistant message item to truncate
content_index	integer	The index of the content part to truncate
audio_end_ms	integer	The duration up to which to truncate the audio, in milliseconds

conversation.item.delete

Remove an item from the conversation history.

Event Structure

{
  "type": "conversation.item.delete",
  "item_id": "item_ABC123"
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.delete"`
item_id	string	The ID of the item to delete

response.create

Instruct the server to create a response via model inference. This event can specify response-specific configuration that overrides session defaults.

Event Structure

{
  "type": "response.create",
  "response": {
    "modalities": ["text", "audio"],
    "instructions": "Be extra helpful and detailed.",
    "voice": {
      "type": "openai",
      "name": "alloy"
    },
    "output_audio_format": "pcm16",
    "temperature": 0.7,
    "max_response_output_tokens": 1000
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.create"`
response	RealtimeResponseOptions	Optional response configuration that overrides session defaults

Example with Tool Choice

{
  "type": "response.create",
  "response": {
    "modalities": ["text"],
    "tools": [
      {
        "type": "function",
        "name": "get_current_time",
        "description": "Get the current time",
        "parameters": {
          "type": "object",
          "properties": {}
        }
      }
    ],
    "tool_choice": "get_current_time",
    "temperature": 0.3
  }
}

Example with Animation

{
  "type": "response.create",
  "response": {
    "modalities": ["audio", "animation"],
    "animation": {
      "model_name": "default",
      "outputs": ["blendshapes", "viseme_id"]
    },
    "voice": {
      "type": "azure-custom",
      "name": "my-expressive-voice",
      "endpoint_id": "12345678-1234-1234-1234-123456789012",
      "style": "excited"
    }
  }
}

response.cancel

Cancel an in-progress response. This immediately stops response generation and related audio output.

Event Structure

{
  "type": "response.cancel"
}

Properties

Field	Type	Description
type	string	Must be `"response.cancel"`

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.retrieve`.
item_id	string	The ID of the item to retrieve.
event_id	string	The ID of the event.

RealtimeClientEventConversationItemTruncate

The client conversation.item.truncate event is used to truncate a previous assistant message's audio. The server produces audio faster than realtime, so this event is useful when the user interrupts to truncate audio that was sent to the client but not yet played. The server's understanding of the audio with the client's playback is synchronized.

Truncating audio deletes the server-side text transcript to ensure there isn't text in the context that the user doesn't know about.

If the client event is successful, the server responds with a conversation.item.truncated event.

Event structure

{
  "type": "conversation.item.truncate",
  "item_id": "<item_id>",
  "content_index": 0,
  "audio_end_ms": 0
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.truncate`.
item_id	string	The ID of the assistant message item to truncate. Only assistant message items can be truncated.
content_index	integer	The index of the content part to truncate. Set this property to "0".
audio_end_ms	integer	Inclusive duration up to which audio is truncated, in milliseconds. If the audio_end_ms is greater than the actual audio duration, the server responds with an error.

RealtimeClientEventInputAudioBufferAppend

The client input_audio_buffer.append event is used to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit.

In Server VAD (Voice Activity Detection) mode, the audio buffer is used to detect speech and the server decides when to commit. When server VAD is disabled, the client can choose how much audio to place in each event up to a maximum of 15 MiB. For example, streaming smaller chunks from the client can allow the VAD to be more responsive.

Unlike most other client events, the server doesn't send a confirmation response to client input_audio_buffer.append event.

Event structure

{
  "type": "input_audio_buffer.append",
  "audio": "<audio>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.append`.
audio	string	Base64-encoded audio bytes. This value must be in the format specified by the `input_audio_format` field in the session configuration.

RealtimeClientEventInputAudioBufferClear

The client input_audio_buffer.clear event is used to clear the audio bytes in the buffer.

The server responds with an input_audio_buffer.cleared event.

Event structure

{
  "type": "input_audio_buffer.clear"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.clear`.

RealtimeClientEventInputAudioBufferCommit

The client input_audio_buffer.commit event is used to commit the user input audio buffer, which creates a new user message item in the conversation. Audio is transcribed if input_audio_transcription is configured for the session.

When in server VAD mode, the client doesn't need to send this event, the server commits the audio buffer automatically. Without server VAD, the client must commit the audio buffer to create a user message item. This client event produces an error if the input audio buffer is empty.

Committing the input audio buffer doesn't create a response from the model.

The server responds with an input_audio_buffer.committed event.

Event structure

{
  "type": "input_audio_buffer.commit"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.commit`.

RealtimeClientEventResponseCancel

The client response.cancel event is used to cancel an in-progress response.

The server will respond with a response.done event with a status of response.status=cancelled.

Event structure

{
  "type": "response.cancel"
}

Properties

Field	Type	Description
type	string	The event type must be `response.cancel`.

RealtimeClientEventResponseCreate

The client response.create event is used to instruct the server to create a response via model inference. When the session is configured in server VAD mode, the server creates responses automatically.

A response includes at least one item, and can have two, in which case the second is a function call. These items are appended to the conversation history.

The server responds with a response.created event, one or more item and content events (such as conversation.item.created and response.content_part.added), and finally a response.done event to indicate the response is complete.

Event structure

{
  "type": "response.create"
}

Properties

Field	Type	Description
type	string	The event type must be `response.create`.
response	RealtimeResponseOptions	The response options.

RealtimeClientEventSessionUpdate

The client session.update event is used to update the session's default configuration. The client can send this event at any time to update the session configuration, and any field can be updated at any time, except for voice.

Only fields that are present are updated. To clear a field (such as instructions), pass an empty string.

The server responds with a session.updated event that contains the full effective configuration.

Event structure

{
  "type": "session.update"
}

Properties

Field	Type	Description
type	string	The event type must be `session.update`.
session	RealtimeRequestSession	The session configuration.

Server Events

The Voice live API sends the following server events to communicate status, responses, and data to the client:

Event	Description
error	Indicates an error occurred during processing
session.created	Sent when a new session is successfully established
session.updated	Sent when session configuration is updated
session.avatar.connecting	Indicates avatar WebRTC connection is being established
conversation.item.created	Sent when a new item is added to the conversation
conversation.item.retrieved	Response to conversation.item.retrieve request
conversation.item.truncated	Confirms item truncation
conversation.item.deleted	Confirms item deletion
conversation.item.input_audio_transcription.completed	Input audio transcription is complete
conversation.item.input_audio_transcription.delta	Streaming input audio transcription
conversation.item.input_audio_transcription.failed	Input audio transcription failed
input_audio_buffer.committed	Input audio buffer has been committed for processing
input_audio_buffer.cleared	Input audio buffer has been cleared
input_audio_buffer.speech_started	Speech detected in input audio buffer (VAD)
input_audio_buffer.speech_stopped	Speech ended in input audio buffer (VAD)
response.created	New response generation has started
response.done	Response generation is complete
response.output_item.added	New output item added to response
response.output_item.done	Output item is complete
response.content_part.added	New content part added to output item
response.content_part.done	Content part is complete
response.text.delta	Streaming text content from the model
response.text.done	Text content is complete
response.audio_transcript.delta	Streaming audio transcript
response.audio_transcript.done	Audio transcript is complete
response.audio.delta	Streaming audio content from the model
response.audio.done	Audio content is complete
response.animation_blendshapes.delta	Streaming animation blendshapes data
response.animation_blendshapes.done	Animation blendshapes data is complete
response.audio_timestamp.delta	Streaming audio timestamp information
response.audio_timestamp.done	Audio timestamp information is complete
response.animation_viseme.delta	Streaming animation viseme data
response.animation_viseme.done	Animation viseme data is complete
response.function_call_arguments.delta	Streaming function call arguments
response.function_call_arguments.done	Function call arguments are complete
mcp_list_tools.in_progress	MCP tool listing is in progress
mcp_list_tools.completed	MCP tool listing is completed
mcp_list_tools.failed	MCP tool listing has failed
response.mcp_call_arguments.delta	Streaming MCP call arguments
response.mcp_call_arguments.done	MCP call arguments are complete
response.mcp_call.in_progress	MCP call is in progress
response.mcp_call.completed	MCP call is completed
response.mcp_call.failed	MCP call has failed

session.created

Sent when a new session is successfully established. This is the first event received after connecting to the API.

Event Structure

{
  "type": "session.created",
  "session": {
    "id": "sess_ABC123DEF456",
    "object": "realtime.session",
    "model": "gpt-realtime",
    "modalities": ["text", "audio"],
    "instructions": "You are a helpful assistant.",
    "voice": {
      "type": "openai",
      "name": "alloy"
    },
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_sampling_rate": 24000,
    "turn_detection": {
      "type": "azure_semantic_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 500
    },
    "temperature": 0.8,
    "max_response_output_tokens": "inf"
  }
}

Properties

Field	Type	Description
type	string	Must be `"session.created"`
session	RealtimeResponseSession	The created session object

session.updated

Sent when session configuration is successfully updated in response to a session.update client event.

Event Structure

{
  "type": "session.updated",
  "session": {
    "id": "sess_ABC123DEF456",
    "voice": {
      "type": "azure-custom",
      "name": "my-voice",
      "endpoint_id": "12345678-1234-1234-1234-123456789012"
    },
    "temperature": 0.7,
    "avatar": {
      "character": "lisa",
      "customized": false
    }
  }
}

Properties

Field	Type	Description
type	string	Must be `"session.updated"`
session	RealtimeResponseSession	The updated session object

session.avatar.connecting

Indicates that an avatar WebRTC connection is being established. This event is sent in response to a session.avatar.connect client event.

Event Structure

{
  "type": "session.avatar.connecting",
  "server_sdp": "<server_sdp>"
}

Properties

Field	Type	Description
type	string	Must be `"session.avatar.connecting"`

conversation.item.created

Sent when a new item is added to the conversation, either through a client conversation.item.create event or automatically during response generation.

Event Structure

{
  "type": "conversation.item.created",
  "previous_item_id": "item_ABC123",
  "item": {
    "id": "item_DEF456",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Hello, how are you?"
      }
    ]
  }
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.created"`
previous_item_id	string	ID of the item after which this item was inserted
item	RealtimeConversationResponseItem	The created conversation item

Example with Audio Item

{
  "type": "conversation.item.created",
  "item": {
    "id": "item_GHI789",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "audio": null,
        "transcript": "What's the weather like today?"
      }
    ]
  }
}

conversation.item.retrieved

Sent in response to a conversation.item.retrieve client event, providing the requested conversation item.

Properties

Field	Type	Description
type	string	Must be `"conversation.item.created"`
item	RealtimeConversationResponseItem	The created conversation item

conversation.item.truncated

The server conversation.item.truncated event is returned when the client truncates an earlier assistant audio message item with a conversation.item.truncate event. This event is used to synchronize the server's understanding of the audio with the client's playback.

This event truncates the audio and removes the server-side text transcript to ensure there's no text in the context that the user doesn't know about.

Event structure

{
  "type": "conversation.item.truncated",
  "item_id": "<item_id>",
  "content_index": 0,
  "audio_end_ms": 0
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.truncated`.
item_id	string	The ID of the assistant message item that was truncated.
content_index	integer	The index of the content part that was truncated.
audio_end_ms	integer	The duration up to which the audio was truncated, in milliseconds.

conversation.item.deleted

Sent in response to a conversation.item.delete client event, confirming that the specified item has been removed from the conversation.

Event Structure

{
  "type": "conversation.item.deleted",
  "item_id": "item_ABC123"
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.deleted"`
item_id	string	ID of the deleted item

response.created

Sent when a new response generation begins. This is the first event in a response sequence.

Event Structure

{
  "type": "response.created",
  "response": {
    "id": "resp_ABC123",
    "object": "realtime.response",
    "status": "in_progress",
    "status_details": null,
    "output": [],
    "usage": {
      "total_tokens": 0,
      "input_tokens": 0,
      "output_tokens": 0
    }
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.created"`
response	RealtimeResponse	The response object that was created

response.done

Sent when response generation is complete. This event contains the final response with all output items and usage statistics.

Event Structure

{
  "type": "response.done",
  "response": {
    "id": "resp_ABC123",
    "object": "realtime.response",
    "status": "completed",
    "status_details": null,
    "output": [
      {
        "id": "item_DEF456",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "assistant",
        "content": [
          {
            "type": "text",
            "text": "Hello! I'm doing well, thank you for asking. How can I help you today?"
          }
        ]
      }
    ],
    "usage": {
      "total_tokens": 87,
      "input_tokens": 52,
      "output_tokens": 35,
      "input_token_details": {
        "cached_tokens": 0,
        "text_tokens": 45,
        "audio_tokens": 7
      },
      "output_token_details": {
        "text_tokens": 15,
        "audio_tokens": 20
      }
    }
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.done"`
response	RealtimeResponse	The completed response object

response.output_item.added

Sent when a new output item is added to the response during generation.

Event Structure

{
  "type": "response.output_item.added",
  "response_id": "resp_ABC123",
  "output_index": 0,
  "item": {
    "id": "item_DEF456",
    "object": "realtime.item",
    "type": "message",
    "status": "in_progress",
    "role": "assistant",
    "content": []
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.output_item.added"`
response_id	string	ID of the response this item belongs to
output_index	integer	Index of the item in the response's output array
item	RealtimeConversationResponseItem	The output item that was added

response.output_item.done

Sent when an output item is complete.

Event Structure

{
  "type": "response.output_item.done",
  "response_id": "resp_ABC123",
  "output_index": 0,
  "item": {
    "id": "item_DEF456",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "Hello! I'm doing well, thank you for asking."
      }
    ]
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.output_item.done"`
response_id	string	ID of the response this item belongs to
output_index	integer	Index of the item in the response's output array
item	RealtimeConversationResponseItem	The completed output item

response.content_part.added

The server response.content_part.added event is returned when a new content part is added to an assistant message item during response generation.

Event Structure

{
  "type": "response.content_part.added",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "part": {
    "type": "text",
    "text": ""
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.content_part.added"`
response_id	string	ID of the response
item_id	string	ID of the item this content part belongs to
output_index	integer	Index of the item in the response
content_index	integer	Index of this content part in the item
part	RealtimeContentPart	The content part that was added

response.content_part.done

The server response.content_part.done event is returned when a content part is done streaming in an assistant message item.

This event is also returned when a response is interrupted, incomplete, or canceled.

Event Structure

{
  "type": "response.content_part.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "part": {
    "type": "text",
    "text": "Hello! I'm doing well, thank you for asking."
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.content_part.done"`
response_id	string	ID of the response
item_id	string	ID of the item this content part belongs to
output_index	integer	Index of the item in the response
content_index	integer	Index of this content part in the item
part	RealtimeContentPart	The completed content part

response.text.delta

Streaming text content from the model. Sent incrementally as the model generates text.

Event Structure

{
  "type": "response.text.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "delta": "Hello! I'm"
}

Properties

Field	Type	Description
type	string	Must be `"response.text.delta"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
delta	string	Incremental text content

response.text.done

Sent when text content generation is complete.

Event Structure

{
  "type": "response.text.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "text": "Hello! I'm doing well, thank you for asking. How can I help you today?"
}

Properties

Field	Type	Description
type	string	Must be `"response.text.done"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
text	string	The complete text content

response.audio.delta

Streaming audio content from the model. Audio is provided as base64-encoded data.

Event Structure

{
  "type": "response.audio.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "delta": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA="
}

Properties

Field	Type	Description
type	string	Must be `"response.audio.delta"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
delta	string	Base64-encoded audio data chunk

response.audio.done

Sent when audio content generation is complete.

Event Structure

{
  "type": "response.audio.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	Must be `"response.audio.done"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part

response.audio_transcript.delta

Streaming transcript of the generated audio content.

Event Structure

{
  "type": "response.audio_transcript.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "delta": "Hello! I'm doing"
}

Properties

Field	Type	Description
type	string	Must be `"response.audio_transcript.delta"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
delta	string	Incremental transcript text

response.audio_transcript.done

Sent when audio transcript generation is complete.

Event Structure

{
  "type": "response.audio_transcript.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "transcript": "Hello! I'm doing well, thank you for asking. How can I help you today?"
}

Properties

Field	Type	Description
type	string	Must be `"response.audio_transcript.done"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
transcript	string	The complete transcript text

conversation.item.input_audio_transcription.completed

The server conversation.item.input_audio_transcription.completed event is the result of audio transcription for speech written to the audio buffer.

Transcription begins when the input audio buffer is committed by the client or server (in server_vad mode). Transcription runs asynchronously with response creation, so this event can come before or after the response events.

Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate speech recognition model such as whisper-1. Thus the transcript can diverge somewhat from the model's interpretation, and should be treated as a rough guide.

Event structure

{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "<item_id>",
  "content_index": 0,
  "transcript": "<transcript>"
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.input_audio_transcription.completed`.
item_id	string	The ID of the user message item containing the audio.
content_index	integer	The index of the content part containing the audio.
transcript	string	The transcribed text.

conversation.item.input_audio_transcription.delta

The server conversation.item.input_audio_transcription.delta event is returned when input audio transcription is configured, and a transcription request for a user message is in progress. This event provides partial transcription results as they become available.

Event structure

{
  "type": "conversation.item.input_audio_transcription.delta",
  "item_id": "<item_id>",
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.input_audio_transcription.delta`.
item_id	string	The ID of the user message item.
content_index	integer	The index of the content part containing the audio.
delta	string	The incremental transcription text.

conversation.item.input_audio_transcription.failed

The server conversation.item.input_audio_transcription.failed event is returned when input audio transcription is configured, and a transcription request for a user message failed. This event is separate from other error events so that the client can identify the related item.

Event structure

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.input_audio_transcription.failed`.
item_id	string	The ID of the user message item.
content_index	integer	The index of the content part containing the audio.
error	object	Details of the transcription error. See nested properties in the next table.

Error properties

Field	Type	Description
type	string	The type of error.
code	string	Error code, if any.
message	string	A human-readable error message.
param	string	Parameter related to the error, if any.

response.animation_blendshapes.delta

The server response.animation_blendshapes.delta event is returned when the model generates animation blendshapes data as part of a response. This event provides incremental blendshapes data as it becomes available.

Event structure

{
  "type": "response.animation_blendshapes.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "frame_index": 0,
  "frames": [
    [0.0, 0.1, 0.2, ..., 1.0]
    ...
  ]
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_blendshapes.delta`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
frame_index	integer	Index of the first frame in this batch of frames
frames	array of array of float	Array of blendshape frames, each frame is an array of blendshape values

response.animation_blendshapes.done

The server response.animation_blendshapes.done event is returned when the model has finished generating animation blendshapes data as part of a response.

Event structure

{
  "type": "response.animation_blendshapes.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_blendshapes.done`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response

response.audio_timestamp.delta

The server response.audio_timestamp.delta event is returned when the model generates audio timestamp data as part of a response. This event provides incremental timestamp data for output audio and text alignment as it becomes available.

Event structure

{
  "type": "response.audio_timestamp.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "audio_offset_ms": 0,
  "audio_duration_ms": 500,
  "text": "Hello",
  "timestamp_type": "word"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_timestamp.delta`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
audio_offset_ms	integer	Audio offset in milliseconds from the start of the audio
audio_duration_ms	integer	Duration of the audio segment in milliseconds
text	string	The text segment corresponding to this audio timestamp
timestamp_type	string	The type of timestamp, currently only "word" is supported

response.audio_timestamp.done

Sent when audio timestamp generation is complete.

Event Structure

{
  "type": "response.audio_timestamp.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_timestamp.done`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part

response.animation_viseme.delta

The server response.animation_viseme.delta event is returned when the model generates animation viseme data as part of a response. This event provides incremental viseme data as it becomes available.

Event Structure

{
  "type": "response.animation_viseme.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "audio_offset_ms": 0,
  "viseme_id": 1
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_viseme.delta`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
audio_offset_ms	integer	Audio offset in milliseconds from the start of the audio
viseme_id	integer	The viseme ID corresponding to the mouth shape for animation

response.animation_viseme.done

The server response.animation_viseme.done event is returned when the model has finished generating animation viseme data as part of a response.

Event Structure

{
  "type": "response.animation_viseme.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_viseme.done`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part

error

The server error event is returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session stays open.

Event structure

{
  "type": "error",
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>",
    "event_id": "<event_id>"
  }
}

Properties

Field	Type	Description
type	string	The event type must be `error`.
error	object	Details of the error. See nested properties in the next table.

Error properties

Field	Type	Description
type	string	The type of error. For example, "invalid_request_error" and "server_error" are error types.
code	string	Error code, if any.
message	string	A human-readable error message.
param	string	Parameter related to the error, if any.
event_id	string	The ID of the client event that caused the error, if applicable.

input_audio_buffer.cleared

The server input_audio_buffer.cleared event is returned when the client clears the input audio buffer with a input_audio_buffer.clear event.

Event structure

{
  "type": "input_audio_buffer.cleared"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.cleared`.

input_audio_buffer.committed

The server input_audio_buffer.committed event is returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The item_id property is the ID of the user message item created. Thus a conversation.item.created event is also sent to the client.

Event structure

{
  "type": "input_audio_buffer.committed",
  "previous_item_id": "<previous_item_id>",
  "item_id": "<item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.committed`.
previous_item_id	string	The ID of the preceding item after which the new item is inserted.
item_id	string	The ID of the user message item created.

input_audio_buffer.speech_started

The server input_audio_buffer.speech_started event is returned in server_vad mode when speech is detected in the audio buffer. This event can happen any time audio is added to the buffer (unless speech is already detected).

Note

The client might want to use this event to interrupt audio playback or provide visual feedback to the user.

The client should expect to receive a input_audio_buffer.speech_stopped event when speech stops. The item_id property is the ID of the user message item created when speech stops. The item_id is also included in the input_audio_buffer.speech_stopped event unless the client manually commits the audio buffer during VAD activation.

Event structure

{
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 0,
  "item_id": "<item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.speech_started`.
audio_start_ms	integer	Milliseconds from the start of all audio written to the buffer during the session when speech was first detected. This property corresponds to the beginning of audio sent to the model, and thus includes the `prefix_padding_ms` configured in the session.
item_id	string	The ID of the user message item created when speech stops.

input_audio_buffer.speech_stopped

The server input_audio_buffer.speech_stopped event is returned in server_vad mode when the server detects the end of speech in the audio buffer.

The server also sends a conversation.item.created event with the user message item created from the audio buffer.

Event structure

{
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 0,
  "item_id": "<item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.speech_stopped`.
audio_end_ms	integer	Milliseconds since the session started when speech stopped. This property corresponds to the end of audio sent to the model, and thus includes the `min_silence_duration_ms` configured in the session.
item_id	string	The ID of the user message item created.

rate_limits.updated

The server rate_limits.updated event is emitted at the beginning of a response to indicate the updated rate limits.

When a response is created, some tokens are reserved for the output tokens. The rate limits shown here reflect that reservation, which is then adjusted accordingly once the response is completed.

Event structure

{
  "type": "rate_limits.updated",
  "rate_limits": [
    {
      "name": "<name>",
      "limit": 0,
      "remaining": 0,
      "reset_seconds": 0
    }
  ]
}

Properties

Field	Type	Description
type	string	The event type must be `rate_limits.updated`.
rate_limits	array of RealtimeRateLimitsItem	The list of rate limit information.

response.audio.delta

The server response.audio.delta event is returned when the model-generated audio is updated.

Event structure

{
  "type": "response.audio.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
delta	string	Base64-encoded audio data delta.

response.audio.done

The server response.audio.done event is returned when the model-generated audio is done.

This event is also returned when a response is interrupted, incomplete, or canceled.

Event structure

{
  "type": "response.audio.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.

response.audio_transcript.delta

The server response.audio_transcript.delta event is returned when the model-generated transcription of audio output is updated.

Event structure

{
  "type": "response.audio_transcript.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_transcript.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
delta	string	The transcript delta.

response.audio_transcript.done

The server response.audio_transcript.done event is returned when the model-generated transcription of audio output is done streaming.

This event is also returned when a response is interrupted, incomplete, or canceled.

Event structure

{
  "type": "response.audio_transcript.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "transcript": "<transcript>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_transcript.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
transcript	string	The final transcript of the audio.

response.function_call_arguments.delta

The server response.function_call_arguments.delta event is returned when the model-generated function call arguments are updated.

Event structure

{
  "type": "response.function_call_arguments.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "call_id": "<call_id>",
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.function_call_arguments.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the function call item.
output_index	integer	The index of the output item in the response.
call_id	string	The ID of the function call.
delta	string	The arguments delta as a JSON string.

response.function_call_arguments.done

The server response.function_call_arguments.done event is returned when the model-generated function call arguments are done streaming.

This event is also returned when a response is interrupted, incomplete, or canceled.

Event structure

{
  "type": "response.function_call_arguments.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "call_id": "<call_id>",
  "arguments": "<arguments>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.function_call_arguments.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the function call item.
output_index	integer	The index of the output item in the response.
call_id	string	The ID of the function call.
arguments	string	The final arguments as a JSON string.

mcp_list_tools.in_progress

The server mcp_list_tools.in_progress event is returned when the service starts listing available tools from a mcp server.

Event structure

{
  "type": "mcp_list_tools.in_progress",
  "item_id": "<mcp_list_tools_item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `mcp_list_tools.in_progress`.
item_id	string	The ID of the MCP list tools item being processed.

mcp_list_tools.completed

The server mcp_list_tools.completed event is returned when the service completes listing available tools from a mcp server.

Event structure

{
  "type": "mcp_list_tools.completed",
  "item_id": "<mcp_list_tools_item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `mcp_list_tools.completed`.
item_id	string	The ID of the MCP list tools item being processed.

mcp_list_tools.failed

The server mcp_list_tools.failed event is returned when the service fails to list available tools from a mcp server.

Event structure

{
  "type": "mcp_list_tools.failed",
  "item_id": "<mcp_list_tools_item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `mcp_list_tools.failed`.
item_id	string	The ID of the MCP list tools item being processed.

response.mcp_call_arguments.delta

The server response.mcp_call_arguments.delta event is returned when the model-generated mcp tool call arguments are updated.

Event structure

{
  "type": "response.mcp_call_arguments.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call_arguments.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the mcp tool call item.
output_index	integer	The index of the output item in the response.
delta	string	The arguments delta as a JSON string.

response.mcp_call_arguments.done

The server response.mcp_call_arguments.done event is returned when the model-generated mcp tool call arguments are done streaming.

Event structure

{
  "type": "response.mcp_call_arguments.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "arguments": "<arguments>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call_arguments.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the mcp tool call item.
output_index	integer	The index of the output item in the response.
arguments	string	The final arguments as a JSON string.

response.mcp_call.in_progress

The server response.mcp_call.in_progress event is returned when an MCP tool call starts processing.

Event structure

{
  "type": "response.mcp_call.in_progress",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call.in_progress`.
item_id	string	The ID of the mcp tool call item.
output_index	integer	The index of the output item in the response.

response.mcp_call.completed

The server response.mcp_call.completed event is returned when an MCP tool call completes successfully.

Event structure

{
  "type": "response.mcp_call.completed",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call.completed`.
item_id	string	The ID of the mcp tool call item.
output_index	integer	The index of the output item in the response.

response.mcp_call.failed

The server response.mcp_call.failed event is returned when an MCP tool call fails.

Event structure

{
  "type": "response.mcp_call.failed",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call.failed`.
item_id	string	The ID of the mcp tool call item.
output_index	integer	The index of the output item in the response.

response.output_item.added

The server response.output_item.added event is returned when a new item is created during response generation.

Event structure

{
  "type": "response.output_item.added",
  "response_id": "<response_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.output_item.added`.
response_id	string	The ID of the response to which the item belongs.
output_index	integer	The index of the output item in the response.
item	RealtimeConversationResponseItem	The item that was added.

response.output_item.done

The server response.output_item.done event is returned when an item is done streaming.

This event is also returned when a response is interrupted, incomplete, or canceled.

Event structure

{
  "type": "response.output_item.done",
  "response_id": "<response_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.output_item.done`.
response_id	string	The ID of the response to which the item belongs.
output_index	integer	The index of the output item in the response.
item	RealtimeConversationResponseItem	The item that is done streaming.

response.text.delta

The server response.text.delta event is returned when the model-generated text is updated. The text corresponds to the text content part of an assistant message item.

Event structure

{
  "type": "response.text.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.text.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
delta	string	The text delta.

response.text.done

The server response.text.done event is returned when the model-generated text is done streaming. The text corresponds to the text content part of an assistant message item.

This event is also returned when a response is interrupted, incomplete, or canceled.

Event structure

{
  "type": "response.text.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "text": "<text>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.text.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
text	string	The final text content.

Components

Audio Formats

RealtimeAudioFormat

Base audio format used for input audio.

Allowed Values:

pcm16 - 16-bit PCM audio format
g711_ulaw - G.711 μ-law audio format
g711_alaw - G.711 A-law audio format

RealtimeOutputAudioFormat

Audio format used for output audio with specific sampling rates.

Allowed Values:

pcm16 - 16-bit PCM audio format at default sampling rate (24kHz)
pcm16_8000hz - 16-bit PCM audio format at 8kHz sampling rate
pcm16_16000hz - 16-bit PCM audio format at 16kHz sampling rate
g711_ulaw - G.711 μ-law (mu-law) audio format at 8kHz sampling rate
g711_alaw - G.711 A-law audio format at 8kHz sampling rate

RealtimeAudioInputTranscriptionSettings

Configuration for input audio transcription.

Field	Type	Description
model	string	The transcription model. Supported with `gpt-realtime` and `gpt-realtime-mini`: `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-transcribe-diarize`. Supported with all other models and agents: `azure-speech`
language	string	Optional language code in BCP-47 (e.g., `en-US`), or ISO-639-1 (e.g., `en`), or multi languages with auto detection, (e.g., `en,zh`).
custom_speech	object	Optional configuration for custom speech models, only valid for `azure-speech` model.
phrase_list	string[]	Optional list of phrase hints to bias recognition, only valid for `azure-speech` model.
prompt	string	Optional prompt text to guide transcription, only valid for `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe` and `gpt-4o-transcribe-diarize` models.

RealtimeInputAudioNoiseReductionSettings

This can be:

An RealtimeOpenAINoiseReduction object
An RealtimeAzureDeepNoiseSuppression object

RealtimeOpenAINoiseReduction

OpenAI noise reduction configuration with explicit type field, only available for gpt-realtime and gpt-realtime-mini models.

Field	Type	Description
type	string	`near_field` or `far_field`

RealtimeAzureDeepNoiseSuppression

Configuration for input audio noise reduction.

Field	Type	Description
type	string	Must be `"azure_deep_noise_suppression"`

RealtimeInputAudioEchoCancellationSettings

Echo cancellation configuration for server-side audio processing.

Field	Type	Description
type	string	Must be `"server_echo_cancellation"`

Voice Configuration

RealtimeVoice

Union of all supported voice configurations.

This can be:

An RealtimeOpenAIVoice object
An RealtimeAzureVoice object

RealtimeOpenAIVoice

OpenAI voice configuration with explicit type field.

Field	Type	Description
type	string	Must be `"openai"`
name	string	OpenAI voice name: `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`, `marin`, `cedar`

RealtimeAzureVoice

Base for Azure voice configurations. This is a discriminated union with different types:

RealtimeAzureCustomVoice

Azure custom voice configuration (preferred for custom voices).

Field	Type	Description
type	string	Must be `"azure-custom"`
name	string	Voice name (cannot be empty)
endpoint_id	string	Endpoint ID (cannot be empty)
temperature	number	Optional. Temperature between 0.0 and 1.0
custom_lexicon_url	string	Optional. URL to custom lexicon
prefer_locales	string[]	Optional. Preferred locales Prefer locales will change the accents of languages. If the value is not set, TTS will use default accent of each language. e.g. When TTS speaking English, it will use the American English accent. And when speaking Spanish, it will use the Mexican Spanish accent. If set the prefer_locales to `["en-GB", "es-ES"]`, the English accent will be British English and the Spanish accent will be European Spanish. And TTS also able to speak other languages like French, Chinese, etc.
locale	string	Optional. Locale specification Enforce The locale for TTS output. If not set, TTS will always use the given locale to speak. e.g. set locale to `en-US`, TTS will always use American English accent to speak the text content, even the text content is in another language. And TTS will output silence if the text content is in Chinese.
style	string	Optional. Voice style
pitch	string	Optional. Pitch adjustment
rate	string	Optional. Speech rate adjustment
volume	string	Optional. Volume adjustment

Example:

{
  "type": "azure-custom",
  "name": "my-custom-voice",
  "endpoint_id": "12345678-1234-1234-1234-123456789012",
  "temperature": 0.7,
  "style": "cheerful",
  "locale": "en-US"
}

RealtimeAzureStandardVoice

Azure standard voice configuration.

Field	Type	Description
type	string	Must be `"azure-standard"`
name	string	Voice name (cannot be empty)
temperature	number	Optional. Temperature between 0.0 and 1.0
custom_lexicon_url	string	Optional. URL to custom lexicon
prefer_locales	string[]	Optional. Preferred locales
locale	string	Optional. Locale specification
style	string	Optional. Voice style
pitch	string	Optional. Pitch adjustment
rate	string	Optional. Speech rate adjustment
volume	string	Optional. Volume adjustment

RealtimeAzurePersonalVoice

Azure personal voice configuration.

Field	Type	Description
type	string	Must be `"azure-personal"`
name	string	Voice name (cannot be empty)
temperature	number	Optional. Temperature between 0.0 and 1.0
model	string	Underlying neural model: `DragonLatestNeural`, `PhoenixLatestNeural`, `PhoenixV2Neural`

Turn Detection

RealtimeTurnDetection

Configuration for turn detection. This is a discriminated union supporting multiple VAD types.

RealtimeServerVAD

Base VAD-based turn detection.

Field	Type	Description
type	string	Must be `"server_vad"`
threshold	number	Optional. Activation threshold (0.0-1.0)
prefix_padding_ms	integer	Optional. Audio padding before speech starts
silence_duration_ms	integer	Optional. Silence duration to detect speech end
end_of_utterance_detection	RealtimeEOUDetection	Optional. End-of-utterance detection config
create_response	boolean	Optional. Enable or disable whether a response is generated.
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: false)
auto_truncate	boolean	Optional. Auto-truncate on interruption (default: false)

RealtimeOpenAISemanticVAD

OpenAI semantic VAD configuration which uses a model to determine when the user has finished speaking. Only available for gpt-realtime and gpt-realtime-mini models.

Field	Type	Description
type	string	Must be `"semantic_vad"`
eagerness	string	Optional. This is a way to control how eager the model is to interrupt the user, tuning the maximum wait timeout. In transcription mode, even if the model doesn't reply, it affects how the audio is chunked. The following values are allowed: - `auto` (default) is equivalent to `medium`, - `low` will let the user take their time to speak, - `high` will chunk the audio as soon as possible. If you want the model to respond more often in conversation mode, or to return transcription events faster in transcription mode, you can set eagerness to `high`. On the other hand, if you want to let the user speak uninterrupted in conversation mode, or if you would like larger transcript chunks in transcription mode, you can set eagerness to `low`.
create_response	boolean	Optional. Enable or disable whether a response is generated.
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: false)

RealtimeAzureSemanticVAD

Azure semantic VAD, which determines when the user starts and speaking using a semantic speech model, providing more robust detection in noisy environments.

Field	Type	Description
type	string	Must be `"azure_semantic_vad"`
threshold	number	Optional. Activation threshold
prefix_padding_ms	integer	Optional. Audio padding before speech
silence_duration_ms	integer	Optional. Silence duration for speech end
end_of_utterance_detection	RealtimeEOUDetection	Optional. EOU detection config
speech_duration_ms	integer	Optional. Minimum speech duration
remove_filler_words	boolean	Optional. Remove filler words (default: false)
languages	string[]	Optional. Supports English. Other languages will be ignored.
create_response	boolean	Optional. Enable or disable whether a response is generated.
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: false)
auto_truncate	boolean	Optional. Auto-truncate on interruption (default: false)

RealtimeAzureSemanticVADMultilingual

Azure semantic VAD (default variant).

Field	Type	Description
type	string	Must be `"azure_semantic_vad_multilingual"`
threshold	number	Optional. Activation threshold
prefix_padding_ms	integer	Optional. Audio padding before speech
silence_duration_ms	integer	Optional. Silence duration for speech end
end_of_utterance_detection	RealtimeEOUDetection	Optional. EOU detection config
speech_duration_ms	integer	Optional. Minimum speech duration
remove_filler_words	boolean	Optional. Remove filler words (default: false).
languages	string[]	Optional. Supports English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi. Other languages will be ignored.
create_response	boolean	Optional. Enable or disable whether a response is generated.
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: false)
auto_truncate	boolean	Optional. Auto-truncate on interruption (default: false)

RealtimeEOUDetection

Azure End-of-Utterance (EOU) could indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency.

Field	Type	Description
model	string	Could be `semantic_detection_v1` supporting English or `semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi
threshold_level	string	Optional. Detection threshold level (`low`, `medium`, `high` and `default`), the default equals `medium` setting. With a lower setting the probability the sentence is complete will be higher.
timeout_ms	number	Optional. Maximum time in milliseconds to wait for more user speech. Defaults to 1000 ms.

Avatar Configuration

RealtimeAvatarConfig

Configuration for avatar streaming and behavior.

Field	Type	Description
ice_servers	RealtimeIceServer[]	Optional. ICE servers for WebRTC
character	string	Character name or ID for the avatar
style	string	Optional. Avatar style (emotional tone, speaking style)
customized	boolean	Whether the avatar is customized
video	RealtimeVideoParams	Optional. Video configuration

RealtimeIceServer

ICE server configuration for WebRTC connection negotiation.

Field	Type	Description
urls	string[]	ICE server URLs (TURN or STUN endpoints)
username	string	Optional. Username for authentication
credential	string	Optional. Credential for authentication

RealtimeVideoParams

Video streaming parameters for avatar.

Field	Type	Description
bitrate	integer	Optional. Bitrate in bits per second (default: 2000000)
codec	string	Optional. Video codec, currently only `h264` (default: `h264`)
crop	RealtimeVideoCrop	Optional. Cropping settings
resolution	RealtimeVideoResolution	Optional. Resolution settings

RealtimeVideoCrop

Video crop rectangle definition.

Field	Type	Description
top_left	integer[]	Top-left corner [x, y], non-negative integers
bottom_right	integer[]	Bottom-right corner [x, y], non-negative integers

RealtimeVideoResolution

Video resolution specification.

Field	Type	Description
width	integer	Width in pixels (must be > 0)
height	integer	Height in pixels (must be > 0)

Animation Configuration

RealtimeAnimation

Configuration for animation outputs including blendshapes and visemes.

Field	Type	Description
model_name	string	Optional. Animation model name (default: `"default"`)
outputs	RealtimeAnimationOutputType[]	Optional. Output types (default: `["blendshapes"]`)

RealtimeAnimationOutputType

Types of animation data to output.

Allowed Values:

blendshapes - Facial blendshapes data
viseme_id - Viseme identifier data

Session Configuration

RealtimeRequestSession

Session configuration object used in session.update events.

Field	Type	Description
model	string	Optional. Model name to use
modalities	RealtimeModality[]	Optional. The supported modalities for the session. For example, "modalities": ["text", "audio"] is the default setting that enables both text and audio modalities. To enable only text, set "modalities": ["text"]. To enable avatar output, set "modalities": ["text", "audio", "avatar"]. You can't enable only audio.
animation	RealtimeAnimation	Optional. Animation configuration
voice	RealtimeVoice	Optional. Voice configuration
instructions	string	Optional. System instructions for the model. The instructions could guide the output audio if OpenAI voices are used but may not apply to Azure voices.
input_audio_sampling_rate	integer	Optional. Input audio sampling rate in Hz (default: 24000 for `pcm16`, 8000 for `g711_ulaw` and `g711_alaw`)
input_audio_format	RealtimeAudioFormat	Optional. Input audio format (default: `pcm16`)
output_audio_format	RealtimeOutputAudioFormat	Optional. Output audio format (default: `pcm16`)
input_audio_noise_reduction	RealtimeInputAudioNoiseReductionSettings	Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio. This property is nullable.
input_audio_echo_cancellation	RealtimeInputAudioEchoCancellationSettings	Configuration for input audio echo cancellation. This can be set to null to turn off. This service side echo cancellation can help improve the quality of the input audio by reducing the impact of echo and reverberation. This property is nullable.
input_audio_transcription	RealtimeAudioInputTranscriptionSettings	The configuration for input audio transcription. The configuration is null (off) by default. Input audio transcription isn't native to the model, since the model consumes audio directly. Transcription runs asynchronously through the `/audio/transcriptions` endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. For additional guidance to the transcription service, the client can optionally set the language and prompt for transcription. This property is nullable.
turn_detection	RealtimeTurnDetection	The turn detection settings for the session. This can be set to null to turn off.
tools	array of RealtimeTool	The tools available to the model for the session.
tool_choice	RealtimeToolChoice	The tool choice for the session. Allowed values: `auto`, `none`, and `required`. Otherwise, you can specify the name of the function to use.
temperature	number	The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8.
max_response_output_tokens	integer or "inf"	The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`. Defaults to `"inf"`.
avatar	RealtimeAvatarConfig	Optional. Avatar configuration
output_audio_timestamp_types	RealtimeAudioTimestampType[]	Optional. Timestamp types for output audio

RealtimeModality

Supported session modalities.

Allowed Values:

text - Text input/output
audio - Audio input/output
animation - Animation output
avatar - Avatar video output

RealtimeAudioTimestampType

Output timestamp types supported in audio response content.

Allowed Values:

word - Timestamps per word in the output audio

Tool Configuration

We support two types of tools: function calling and MCP tools which allow you connect to a mcp server.

RealtimeTool

Tool definition for function calling.

Field	Type	Description
type	string	Must be `"function"`
name	string	Function name
description	string	Function description and usage guidelines
parameters	object	Function parameters as JSON schema object

RealtimeToolChoice

Tool selection strategy.

This can be:

"auto" - Let the model choose
"none" - Don't use tools
"required" - Must use a tool
{ "type": "function", "name": "function_name" } - Use specific function

MCPTool

MCP tool configuration.

Field	Type	Description
type	string	Must be `"mcp"`
server_label	string	Required. The label of the MCP server.
server_url	string	Required. The server URL of the MCP server.
allowed_tools	string[]	Optional. The list of allowed tool names. If not specified, all tools are allowed.
headers	object	Optional. Additional headers to include in MCP requests.
authorization	string	Optional. Authorization token for MCP requests.
require_approval	string or dictionary	Optional. If set to a string, The value must be `never` or `always`. If set to a dictionary, it must be in format `{"never": ["<tool_name_1>", "<tool_name_2>"], "always": ["<tool_name_3>"]}`. Default value is `always`. When set to `always`, the tool execution requires approval, mcp_approval_request will be sent to client when mcp argument done, and will only be executed when mcp_approval_response with `approve=true` is received. When set to `never`, the tool will be executed automatically without approval.

RealtimeConversationResponseItem

This is a union type that can be one of the following:

RealtimeConversationUserMessageItem

User message item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"message"`
object	string	Must be `"conversation.item"`
role	string	Must be `"user"`
content	RealtimeInputTextContentPart	The content of the message.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationAssistantMessageItem

Assistant message item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"message"`
object	string	Must be `"conversation.item"`
role	string	Must be `"assistant"`
content	RealtimeOutputTextContentPart[] or RealtimeOutputAudioContentPart[]	The content of the message.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationSystemMessageItem

System message item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"message"`
object	string	Must be `"conversation.item"`
role	string	Must be `"system"`
content	RealtimeInputTextContentPart[]	The content of the message.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationFunctionCallItem

Function call request item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"function_call"`
object	string	Must be `"conversation.item"`
name	string	The name of the function to call.
arguments	string	The arguments for the function call as a JSON string.
call_id	string	The unique ID of the function call.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationFunctionCallOutputItem

Function call response item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"function_call_output"`
object	string	Must be `"conversation.item"`
name	string	The name of the function that was called.
output	string	The output of the function call.
call_id	string	The unique ID of the function call.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationMCPListToolsItem

MCP list tools response item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"mcp_list_tools"`
server_label	string	The label of the MCP server.

RealtimeConversationMCPCallItem

MCP call response item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"mcp_call"`
server_label	string	The label of the MCP server.
name	string	The name of the tool to call.
approval_request_id	string	The approval request ID for the MCP call.
arguments	string	The arguments for the MCP call.
output	string	The output of the MCP call.
error	object	The error details if the MCP call failed.

RealtimeConversationMCPApprovalRequestItem

MCP approval request item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"mcp_approval_request"`
server_label	string	The label of the MCP server.
name	string	The name of the tool to call.
arguments	string	The arguments for the MCP call.

RealtimeItemStatus

Status of conversation items.

Allowed Values:

in_progress - Currently being processed
completed - Successfully completed
incomplete - Incomplete (interrupted or failed)

RealtimeContentPart

Content part within a message.

RealtimeInputTextContentPart

Text content part.

Field	Type	Description
type	string	Must be `"input_text"`
text	string	The text content

RealtimeOutputTextContentPart

Text content part.

Field	Type	Description
type	string	Must be `"text"`
text	string	The text content

RealtimeInputAudioContentPart

Audio content part.

Field	Type	Description
type	string	Must be `"input_audio"`
audio	string	Optional. Base64-encoded audio data
transcript	string	Optional. Audio transcript

RealtimeOutputAudioContentPart

Audio content part.

Field	Type	Description
type	string	Must be `"audio"`
audio	string	Base64-encoded audio data
transcript	string	Optional. Audio transcript

Response Objects

RealtimeResponse

Response object representing a model inference response.

Field	Type	Description
id	string	Optional. Response ID
object	string	Optional. Always `"realtime.response"`
status	RealtimeResponseStatus	Optional. Response status
status_details	RealtimeResponseStatusDetails	Optional. Status details
output	RealtimeConversationResponseItem[]	Optional. Output items
usage	RealtimeUsage	Optional. Token usage statistics
conversation_id	string	Optional. Associated conversation ID
voice	RealtimeVoice	Optional. Voice used for response
modalities	string[]	Optional. Modalities used
output_audio_format	RealtimeOutputAudioFormat	Optional. Audio format used
temperature	number	Optional. Temperature used
max_response_output_tokens	integer or "inf"	Optional. Max tokens used

RealtimeResponseStatus

Response status values.

Allowed Values:

in_progress - Response is being generated
completed - Response completed successfully
cancelled - Response was cancelled
incomplete - Response incomplete (interrupted)
failed - Response failed with error

RealtimeUsage

Token usage statistics.

Field	Type	Description
total_tokens	integer	Total tokens used
input_tokens	integer	Input tokens used
output_tokens	integer	Output tokens generated
input_token_details	TokenDetails	Breakdown of input tokens
output_token_details	TokenDetails	Breakdown of output tokens

TokenDetails

Detailed token usage breakdown.

Field	Type	Description
cached_tokens	integer	Optional. Cached tokens used
text_tokens	integer	Optional. Text tokens used
audio_tokens	integer	Optional. Audio tokens used

Error Handling

RealtimeErrorDetails

Error information object.

Field	Type	Description
type	string	Error type (e.g., `"invalid_request_error"`, `"server_error"`)
code	string	Optional. Specific error code
message	string	Human-readable error description
param	string	Optional. Parameter related to the error
event_id	string	Optional. ID of the client event that caused the error

RealtimeConversationRequestItem

You use the RealtimeConversationRequestItem object to create a new item in the conversation via the conversation.item.create event.

This is a union type that can be one of the following:

RealtimeSystemMessageItem

A system message item.

Field	Type	Description
type	string	The type of the item. Allowed values: `message`
role	string	The role of the message. Allowed values: `system`
content	array of RealtimeInputTextContentPart	The content of the message.
id	string	The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.

RealtimeUserMessageItem

A user message item.

Field	Type	Description
type	string	The type of the item. Allowed values: `message`
role	string	The role of the message. Allowed values: `user`
content	array of RealtimeInputTextContentPart or RealtimeInputAudioContentPart	The content of the message.
id	string	The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.

RealtimeAssistantMessageItem

An assistant message item.

Field	Type	Description
type	string	The type of the item. Allowed values: `message`
role	string	The role of the message. Allowed values: `assistant`
content	array of RealtimeOutputTextContentPart	The content of the message.

RealtimeFunctionCallItem

A function call item.

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call`
name	string	The name of the function to call.
arguments	string	The arguments of the function call as a JSON string.
call_id	string	The ID of the function call item.
id	string	The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.

RealtimeFunctionCallOutputItem

A function call output item.

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call_output`
call_id	string	The ID of the function call item.
output	string	The output of the function call, this is a free-form string with the function result, also could be empty.
id	string	The unique ID of the item. If the client doesn't provide an ID, the server generates one.

RealtimeMCPApprovalResponseItem

An MCP approval response item.

Field	Type	Description
type	string	The type of the item. Allowed values: `mcp_approval_response`
approve	boolean	Whether the MCP request is approved.
approval_request_id	string	The ID of the MCP approval request.

RealtimeFunctionTool

The definition of a function tool as used by the realtime endpoint.

Field	Type	Description
type	string	The type of the tool. Allowed values: `function`
name	string	The name of the function.
description	string	The description of the function, including usage guidelines. For example, "Use this function to get the current time."
parameters	object	The parameters of the function in the form of a JSON object.

RealtimeItemStatus

Allowed Values:

in_progress
completed
incomplete

RealtimeResponseAudioContentPart

Field	Type	Description
type	string	The type of the content part. Allowed values: `audio`
transcript	string	The transcript of the audio. This property is nullable.

RealtimeResponseFunctionCallItem

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call`
name	string	The name of the function call item.
call_id	string	The ID of the function call item.
arguments	string	The arguments of the function call item.
status	RealtimeItemStatus	The status of the item.

RealtimeResponseFunctionCallOutputItem

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call_output`
call_id	string	The ID of the function call item.
output	string	The output of the function call item.

RealtimeResponseOptions

Field	Type	Description
modalities	array	The modalities that the session supports. Allowed values: `text`, `audio` For example, `"modalities": ["text", "audio"]` is the default setting that enables both text and audio modalities. To enable only text, set `"modalities": ["text"]`. You can't enable only audio.
instructions	string	The instructions (the system message) to guide the model's responses.
voice	RealtimeVoice	The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed.
tools	array of RealtimeTool	The tools available to the model for the session.
tool_choice	RealtimeToolChoice	The tool choice for the session.
temperature	number	The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8.
max_response_output_tokens	integer or "inf"	The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`. Defaults to `"inf"`.
conversation	string	Controls which conversation the response is added to. The supported values are `auto` and `none`. The `auto` value (or not setting this property) ensures that the contents of the response are added to the session's default conversation. Set this property to `none` to create an out-of-band response where items won't be added to the default conversation. Defaults to `"auto"`
metadata	map	Set of up to 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys can be a maximum of 64 characters long and values can be a maximum of 512 characters long. For example: `metadata: { topic: "classification" }`

RealtimeResponseSession

The RealtimeResponseSession object represents a session in the Realtime API. It's used in some of the server events, such as:

session.created
session.updated

Field	Type	Description
object	string	The session object. Allowed values: `realtime.session`
id	string	The unique ID of the session.
model	string	The model used for the session.
modalities	array	The modalities that the session supports. Allowed values: `text`, `audio` For example, `"modalities": ["text", "audio"]` is the default setting that enables both text and audio modalities. To enable only text, set `"modalities": ["text"]`. You can't enable only audio.
instructions	string	The instructions (the system message) to guide the model's text and audio responses. Here are some example instructions to help guide content and format of text and audio responses: `"instructions": "be succinct"` `"instructions": "act friendly"` `"instructions": "here are examples of good responses"` Here are some example instructions to help guide audio behavior: `"instructions": "talk quickly"` `"instructions": "inject emotion into your voice"` `"instructions": "laugh frequently"` While the model might not always follow these instructions, they provide guidance on the desired behavior.
voice	RealtimeVoice	The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed.
input_audio_sampling_rate	integer	The sampling rate for the input audio.
input_audio_format	RealtimeAudioFormat	The format for the input audio.
output_audio_format	RealtimeAudioFormat	The format for the output audio.
input_audio_transcription	RealtimeAudioInputTranscriptionSettings	The settings for audio input transcription. This property is nullable.
turn_detection	RealtimeTurnDetection	The turn detection settings for the session. This property is nullable.
tools	array of RealtimeTool	The tools available to the model for the session.
tool_choice	RealtimeToolChoice	The tool choice for the session.
temperature	number	The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8.
max_response_output_tokens	integer or "inf"	The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`.

RealtimeResponseStatusDetails

Field	Type	Description
type	RealtimeResponseStatus	The status of the response.

RealtimeRateLimitsItem

Field	Type	Description
name	string	The rate limit property name that this item includes information about.
limit	integer	The maximum configured limit for this rate limit property.
remaining	integer	The remaining quota available against the configured limit for this rate limit property.
reset_seconds	number	The remaining time, in seconds, until this rate limit property is reset.

Try the Voice live quickstart
Try the Voice live agents quickstart
Learn more about How to use the Voice live API

Feedback

Was this page helpful?

Last updated on 2025-09-27

Share via

Voice live API Reference

Key Features

Client Events

session.update

Event Structure

Properties

Example with Azure Custom Voice

session.avatar.connect

Event Structure

Properties

input_audio_buffer.append

Event Structure

Properties

input_audio_buffer.commit

Event Structure

Properties

input_audio_buffer.clear

Event Structure

Properties

conversation.item.create

Event Structure

Properties

Example with Audio Content

Example with Function Call

Example with MCP call

conversation.item.retrieve

Event Structure

Properties

conversation.item.truncate

Event Structure

Properties

conversation.item.delete

Event Structure

Properties

response.create

Event Structure

Properties

Example with Tool Choice

Example with Animation

response.cancel

Event Structure

Properties

Properties

RealtimeClientEventConversationItemTruncate

Event structure

Properties

RealtimeClientEventInputAudioBufferAppend

Event structure

Properties

RealtimeClientEventInputAudioBufferClear

Event structure

Properties

RealtimeClientEventInputAudioBufferCommit

Event structure

Properties

RealtimeClientEventResponseCancel

Event structure

Properties

RealtimeClientEventResponseCreate

Event structure

Properties

RealtimeClientEventSessionUpdate

Event structure

Properties

Server Events

session.created

Event Structure

Properties

session.updated

Event Structure

Properties

session.avatar.connecting

Event Structure

Properties

conversation.item.created

Event Structure

Properties

Example with Audio Item

conversation.item.retrieved