models Package
Classes
| AgentConfig |
Configuration for the agent. |
| Animation |
Configuration for animation outputs including blendshapes and visemes metadata. |
| AssistantMessageItem |
An assistant message item within a conversation. |
| AudioEchoCancellation |
Echo cancellation configuration for server-side audio processing. |
| AudioInputTranscriptionOptions |
Configuration for input audio transcription. |
| AudioNoiseReduction |
Configuration for input audio noise reduction. |
| AvatarConfig |
Configuration for avatar streaming and behavior during the session. |
| AzureCustomVoice |
Azure custom voice configuration. |
| AzurePersonalVoice |
Azure personal voice configuration. |
| AzureSemanticDetection |
Azure semantic end-of-utterance detection (default). |
| AzureSemanticDetectionEn |
Azure semantic end-of-utterance detection (English-optimized). |
| AzureSemanticDetectionMultilingual |
Azure semantic end-of-utterance detection (multilingual). |
| AzureSemanticVad |
Server Speech Detection (Azure semantic VAD, default variant). |
| AzureSemanticVadEn |
Server Speech Detection (Azure semantic VAD, English-only). |
| AzureSemanticVadMultilingual |
Server Speech Detection (Azure semantic VAD). |
| AzureStandardVoice |
Azure standard voice configuration. |
| AzureVoice |
Base for Azure voice configurations. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureCustomVoice, AzurePersonalVoice, AzureStandardVoice |
| Background |
Defines a video background, either a solid color or an image URL (mutually exclusive). |
| CachedTokenDetails |
Details of output token usage. |
| ClientEvent |
A voicelive client event. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ClientEventConversationItemCreate, ClientEventConversationItemDelete, ClientEventConversationItemRetrieve, ClientEventConversationItemTruncate, ClientEventInputAudioClear, ClientEventInputAudioTurnAppend, ClientEventInputAudioTurnCancel, ClientEventInputAudioTurnEnd, ClientEventInputAudioTurnStart, ClientEventInputAudioBufferAppend, ClientEventInputAudioBufferClear, ClientEventInputAudioBufferCommit, ClientEventResponseCancel, ClientEventResponseCreate, ClientEventSessionAvatarConnect, ClientEventSessionUpdate |
| ClientEventConversationItemCreate |
Add a new Item to the Conversation's context, including messages, function
calls, and function call responses. This event can be used both to populate a
"history" of the conversation and to add new items mid-stream, but has the
current limitation that it cannot populate assistant audio messages.
If successful, the server will respond with a |
| ClientEventConversationItemDelete |
Send this event when you want to remove any item from the conversation
history. The server will respond with a |
| ClientEventConversationItemRetrieve |
Send this event when you want to retrieve the server's representation of a specific item in the
conversation history. This is useful, for example, to inspect user audio after noise
cancellation and VAD.
The server will respond with a |
| ClientEventConversationItemTruncate |
Send this event to truncate a previous assistant message's audio. The server
will produce audio faster than voicelive, so this event is useful when the user
interrupts to truncate audio that has already been sent to the client but not
yet played. This will synchronize the server's understanding of the audio with
the client's playback.
Truncating audio will delete the server-side text transcript to ensure there
is not text in the context that hasn't been heard by the user.
If successful, the server will respond with a |
| ClientEventInputAudioBufferAppend |
Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. In Server VAD mode, the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike made other client events, the server will not send a confirmation response to this event. |
| ClientEventInputAudioBufferClear |
Send this event to clear the audio bytes in the buffer. The server will
respond with an |
| ClientEventInputAudioBufferCommit |
Send this event to commit the user input audio buffer, which will create a
new user message item in the conversation. This event will produce an error
if the input audio buffer is empty. When in Server VAD mode, the client does
not need to send this event, the server will commit the audio buffer
automatically.
Committing the input audio buffer will trigger input audio transcription
(if enabled in session configuration), but it will not create a response
from the model. The server will respond with an |
| ClientEventInputAudioClear |
Clears all input audio currently being streamed. |
| ClientEventInputAudioTurnAppend |
Appends audio data to an ongoing input turn. |
| ClientEventInputAudioTurnCancel |
Cancels an in-progress input audio turn. |
| ClientEventInputAudioTurnEnd |
Marks the end of an audio input turn. |
| ClientEventInputAudioTurnStart |
Indicates the start of a new audio input turn. |
| ClientEventResponseCancel |
Send this event to cancel an in-progress response. The server will respond
with a |
| ClientEventResponseCreate |
This event instructs the server to create a Response, which means triggering
model inference. When in Server VAD mode, the server will create Responses
automatically.
A Response will include at least one Item, and may have two, in which case
the second will be a function call. These Items will be appended to the
conversation history.
The server will respond with a |
| ClientEventSessionAvatarConnect |
Sent when the client connects and provides its SDP (Session Description Protocol) for avatar-related media negotiation. |
| ClientEventSessionUpdate |
Send this event to update the session's default configuration.
The client may send this event at any time to update any field,
except for |
| ContentPart |
Base for any content part; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseAudioContentPart, RequestAudioContentPart, RequestTextContentPart, ResponseTextContentPart |
| ConversationItemBase |
The item to add to the conversation. |
| ConversationRequestItem |
Base for any response item; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionCallItem, FunctionCallOutputItem, MessageItem |
| EouDetection |
Top-level union for end-of-utterance (EOU) semantic detection configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticDetection, AzureSemanticDetectionEn, AzureSemanticDetectionMultilingual |
| ErrorResponse |
Standard error response envelope. |
| FunctionCallItem |
A function call item within a conversation. |
| FunctionCallOutputItem |
A function call output item within a conversation. |
| FunctionTool |
The definition of a function tool as used by the voicelive endpoint. |
| IceServer |
ICE server configuration for WebRTC connection negotiation. |
| InputAudioContentPart |
Input audio content part. |
| InputTextContentPart |
Input text content part. |
| InputTokenDetails |
Details of input token usage. |
| LogProbProperties |
A single log probability entry for a token. |
| MessageContentPart |
Base for any message content part; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: InputAudioContentPart, InputTextContentPart, OutputTextContentPart |
| MessageItem |
A message item within a conversation. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AssistantMessageItem, SystemMessageItem, UserMessageItem |
| OpenAIVoice |
OpenAI voice configuration with explicit type field. This provides a unified interface for OpenAI voices, complementing the existing string-based OpenAIVoiceName for backward compatibility. |
| OutputTextContentPart |
Output text content part. |
| OutputTokenDetails |
Details of output token usage. |
| RequestAudioContentPart |
An audio content part for a request. |
| RequestSession |
Base for session configuration shared between request and response. |
| RequestTextContentPart |
A text content part for a request. |
| Response |
The response resource. |
| ResponseAudioContentPart |
An audio content part for a response. |
| ResponseCancelledDetails |
Details for a cancelled response. |
| ResponseCreateParams |
Create a new VoiceLive response with these parameters. |
| ResponseFailedDetails |
Details for a failed response. |
| ResponseFunctionCallItem |
A function call item within a conversation. |
| ResponseFunctionCallOutputItem |
A function call output item within a conversation. |
| ResponseIncompleteDetails |
Details for an incomplete response. |
| ResponseItem |
Base for any response item; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseFunctionCallItem, ResponseFunctionCallOutputItem, ResponseMessageItem |
| ResponseMessageItem |
Base type for message item within a conversation. |
| ResponseSession |
Base for session configuration in the response. |
| ResponseStatusDetails |
Base for all non-success response details. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseCancelledDetails, ResponseFailedDetails, ResponseIncompleteDetails |
| ResponseTextContentPart |
A text content part for a response. |
| ServerEvent |
A voicelive server event. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ServerEventConversationItemCreated, ServerEventConversationItemDeleted, ServerEventConversationItemInputAudioTranscriptionCompleted, ServerEventConversationItemInputAudioTranscriptionDelta, ServerEventConversationItemInputAudioTranscriptionFailed, ServerEventConversationItemRetrieved, ServerEventConversationItemTruncated, ServerEventError, ServerEventInputAudioBufferCleared, ServerEventInputAudioBufferCommitted, ServerEventInputAudioBufferSpeechStarted, ServerEventInputAudioBufferSpeechStopped, ServerEventResponseAnimationBlendshapeDelta, ServerEventResponseAnimationBlendshapeDone, ServerEventResponseAnimationVisemeDelta, ServerEventResponseAnimationVisemeDone, ServerEventResponseAudioDelta, ServerEventResponseAudioDone, ServerEventResponseAudioTimestampDelta, ServerEventResponseAudioTimestampDone, ServerEventResponseAudioTranscriptDelta, ServerEventResponseAudioTranscriptDone, ServerEventResponseContentPartAdded, ServerEventResponseContentPartDone, ServerEventResponseCreated, ServerEventResponseDone, ServerEventResponseFunctionCallArgumentsDelta, ServerEventResponseFunctionCallArgumentsDone, ServerEventResponseOutputItemAdded, ServerEventResponseOutputItemDone, ServerEventResponseTextDelta, ServerEventResponseTextDone, ServerEventSessionAvatarConnecting, ServerEventSessionCreated, ServerEventSessionUpdated |
| ServerEventConversationItemCreated |
Returned when a conversation item is created. There are several scenarios that produce this event: The server is generating a Response, which if successful will produce either one or two Items, which will be of type message (role assistant) or type function_call. The input audio buffer has been committed, either by the client or the server (in server_vad mode). The server will take the content of the input audio buffer and add it to a new user message Item. The client has sent a conversation.item.create event to add a new Item to the Conversation. |
| ServerEventConversationItemDeleted |
Returned when an item in the conversation is deleted by the client with a
|
| ServerEventConversationItemInputAudioTranscriptionCompleted |
This event is the output of audio transcription for user audio written to the
user audio buffer. Transcription begins when the input audio buffer is
committed by the client or server (in |
| ServerEventConversationItemInputAudioTranscriptionDelta |
Returned when the text value of an input audio transcription content part is updated. |
| ServerEventConversationItemInputAudioTranscriptionFailed |
Returned when input audio transcription is configured, and a transcription
request for a user message failed. These events are separate from other
|
| ServerEventConversationItemRetrieved |
Returned when a conversation item is retrieved with |
| ServerEventConversationItemTruncated |
Returned when an earlier assistant audio message item is truncated by the
client with a |
| ServerEventError |
Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default. |
| ServerEventErrorDetails |
Details of the error. |
| ServerEventInputAudioBufferCleared |
Returned when the input audio buffer is cleared by the client with a
|
| ServerEventInputAudioBufferCommitted |
Returned when an input audio buffer is committed, either by the client or
automatically in server VAD mode. The |
| ServerEventInputAudioBufferSpeechStarted |
Sent by the server when in |
| ServerEventInputAudioBufferSpeechStopped |
Returned in |
| ServerEventResponseAnimationBlendshapeDelta |
Represents a delta update of blendshape animation frames for a specific output of a response. |
| ServerEventResponseAnimationBlendshapeDone |
Indicates the completion of blendshape animation processing for a specific output of a response. |
| ServerEventResponseAnimationVisemeDelta |
Represents a viseme ID delta update for animation based on audio. |
| ServerEventResponseAnimationVisemeDone |
Indicates completion of viseme animation delivery for a response. |
| ServerEventResponseAudioDelta |
Returned when the model-generated audio is updated. |
| ServerEventResponseAudioDone |
Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseAudioTimestampDelta |
Represents a word-level audio timestamp delta for a response. |
| ServerEventResponseAudioTimestampDone |
Indicates completion of audio timestamp delivery for a response. |
| ServerEventResponseAudioTranscriptDelta |
Returned when the model-generated transcription of audio output is updated. |
| ServerEventResponseAudioTranscriptDone |
Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseContentPartAdded |
Returned when a new content part is added to an assistant message item during response generation. |
| ServerEventResponseContentPartDone |
Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseCreated |
Returned when a new Response is created. The first event of response creation,
where the response is in an initial state of |
| ServerEventResponseDone |
Returned when a Response is done streaming. Always emitted, no matter the
final state. The Response object included in the |
| ServerEventResponseFunctionCallArgumentsDelta |
Returned when the model-generated function call arguments are updated. |
| ServerEventResponseFunctionCallArgumentsDone |
Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseOutputItemAdded |
Returned when a new Item is created during Response generation. |
| ServerEventResponseOutputItemDone |
Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseTextDelta |
Returned when the text value of a "text" content part is updated. |
| ServerEventResponseTextDone |
Returned when the text value of a "text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventSessionAvatarConnecting |
Sent when the server is in the process of establishing an avatar media connection and provides its SDP answer. |
| ServerEventSessionCreated |
Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration. |
| ServerEventSessionUpdated |
Returned when a session is updated with a |
| ServerVad |
Base model for VAD-based turn detection. |
| SessionBase |
VoiceLive session object configuration. |
| SystemMessageItem |
A system message item within a conversation. |
| TokenUsage |
Overall usage statistics for a response. |
| Tool |
The base representation of a voicelive tool definition. You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionTool |
| ToolChoiceFunctionSelection |
The representation of a voicelive tool_choice selecting a named function tool. |
| ToolChoiceSelection |
A base representation for a voicelive tool_choice selecting a named tool. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ToolChoiceFunctionSelection |
| TurnDetection |
Top-level union for turn detection configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticVad, AzureSemanticVadEn, AzureSemanticVadMultilingual, ServerVad |
| UserMessageItem |
A user message item within a conversation. |
| VideoCrop |
Defines a video crop rectangle using top-left and bottom-right coordinates. |
| VideoParams |
Video streaming parameters for avatar. |
| VideoResolution |
Resolution of the video feed in pixels. |
| VoiceLiveErrorDetails |
Error object returned in case of API failure. |
Enums
| AnimationOutputType |
Specifies the types of animation data to output. |
| AudioTimestampType |
Output timestamp types supported in audio response content. |
| AzureVoiceType |
Union of all supported Azure voice types. |
| ClientEventType |
Client event types used in VoiceLive protocol. |
| ContentPartType |
Type of ContentPartType. |
| EouThresholdLevel |
Threshold level settings for Azure semantic end-of-utterance detection. |
| InputAudioFormat |
Input audio format types supported. |
| ItemParamStatus |
Indicates the processing status of an item or parameter. |
| ItemType |
Type of ItemType. |
| MessageRole |
Type of MessageRole. |
| Modality |
Supported modalities for the session. |
| OpenAIVoiceName |
Supported OpenAI voice names (string enum). |
| OutputAudioFormat |
Output audio format types supported. |
| PersonalVoiceModels |
PersonalVoice models. |
| ResponseItemStatus |
Indicates the processing status of a response item. |
| ResponseStatus |
Terminal status of a response. |
| ServerEventType |
Server event types used in VoiceLive protocol. |
| ToolChoiceLiteral |
The available set of mode-level, string literal tool_choice options for the voicelive endpoint. |
| ToolType |
The supported tool type discriminators for voicelive tools. Currently, only 'function' tools are supported. |
| TurnDetectionType |
Type of TurnDetectionType. |