Udostępnij przez


models Package

Classes

AgentConfig

Configuration for the agent.

Animation

Configuration for animation outputs including blendshapes and visemes metadata.

AssistantMessageItem

An assistant message item within a conversation.

AudioEchoCancellation

Echo cancellation configuration for server-side audio processing.

AudioInputTranscriptionOptions

Configuration for input audio transcription.

AudioNoiseReduction

Configuration for input audio noise reduction.

AvatarConfig

Configuration for avatar streaming and behavior during the session.

AzureCustomVoice

Azure custom voice configuration.

AzurePersonalVoice

Azure personal voice configuration.

AzureSemanticDetection

Azure semantic end-of-utterance detection (default).

AzureSemanticDetectionEn

Azure semantic end-of-utterance detection (English-optimized).

AzureSemanticDetectionMultilingual

Azure semantic end-of-utterance detection (multilingual).

AzureSemanticVad

Server Speech Detection (Azure semantic VAD, default variant).

AzureSemanticVadEn

Server Speech Detection (Azure semantic VAD, English-only).

AzureSemanticVadMultilingual

Server Speech Detection (Azure semantic VAD).

AzureStandardVoice

Azure standard voice configuration.

AzureVoice

Base for Azure voice configurations.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureCustomVoice, AzurePersonalVoice, AzureStandardVoice

Background

Defines a video background, either a solid color or an image URL (mutually exclusive).

CachedTokenDetails

Details of output token usage.

ClientEvent

A voicelive client event.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: ClientEventConversationItemCreate, ClientEventConversationItemDelete, ClientEventConversationItemRetrieve, ClientEventConversationItemTruncate, ClientEventInputAudioClear, ClientEventInputAudioTurnAppend, ClientEventInputAudioTurnCancel, ClientEventInputAudioTurnEnd, ClientEventInputAudioTurnStart, ClientEventInputAudioBufferAppend, ClientEventInputAudioBufferClear, ClientEventInputAudioBufferCommit, ClientEventResponseCancel, ClientEventResponseCreate, ClientEventSessionAvatarConnect, ClientEventSessionUpdate

ClientEventConversationItemCreate

Add a new Item to the Conversation's context, including messages, function calls, and function call responses. This event can be used both to populate a "history" of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages. If successful, the server will respond with a conversation.item.created event, otherwise an error event will be sent.

ClientEventConversationItemDelete

Send this event when you want to remove any item from the conversation history. The server will respond with a conversation.item.deleted event, unless the item does not exist in the conversation history, in which case the server will respond with an error.

ClientEventConversationItemRetrieve

Send this event when you want to retrieve the server's representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD. The server will respond with a conversation.item.retrieved event, unless the item does not exist in the conversation history, in which case the server will respond with an error.

ClientEventConversationItemTruncate

Send this event to truncate a previous assistant message's audio. The server will produce audio faster than voicelive, so this event is useful when the user interrupts to truncate audio that has already been sent to the client but not yet played. This will synchronize the server's understanding of the audio with the client's playback. Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user. If successful, the server will respond with a conversation.item.truncated event.

ClientEventInputAudioBufferAppend

Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. In Server VAD mode, the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike made other client events, the server will not send a confirmation response to this event.

ClientEventInputAudioBufferClear

Send this event to clear the audio bytes in the buffer. The server will respond with an input_audio_buffer.cleared event.

ClientEventInputAudioBufferCommit

Send this event to commit the user input audio buffer, which will create a new user message item in the conversation. This event will produce an error if the input audio buffer is empty. When in Server VAD mode, the client does not need to send this event, the server will commit the audio buffer automatically. Committing the input audio buffer will trigger input audio transcription (if enabled in session configuration), but it will not create a response from the model. The server will respond with an input_audio_buffer.committed event.

ClientEventInputAudioClear

Clears all input audio currently being streamed.

ClientEventInputAudioTurnAppend

Appends audio data to an ongoing input turn.

ClientEventInputAudioTurnCancel

Cancels an in-progress input audio turn.

ClientEventInputAudioTurnEnd

Marks the end of an audio input turn.

ClientEventInputAudioTurnStart

Indicates the start of a new audio input turn.

ClientEventResponseCancel

Send this event to cancel an in-progress response. The server will respond with a response.cancelled event or an error if there is no response to cancel.

ClientEventResponseCreate

This event instructs the server to create a Response, which means triggering model inference. When in Server VAD mode, the server will create Responses automatically. A Response will include at least one Item, and may have two, in which case the second will be a function call. These Items will be appended to the conversation history. The server will respond with a response.created event, events for Items and content created, and finally a response.done event to indicate the Response is complete. The response.create event includes inference configuration like instructions, and temperature. These fields will override the Session's configuration for this Response only.

ClientEventSessionAvatarConnect

Sent when the client connects and provides its SDP (Session Description Protocol) for avatar-related media negotiation.

ClientEventSessionUpdate

Send this event to update the session's default configuration. The client may send this event at any time to update any field, except for voice. However, note that once a session has been initialized with a particular model, it can't be changed to another model using session.update. When the server receives a session.update, it will respond with a session.updated event showing the full, effective configuration. Only the fields that are present are updated. To clear a field like instructions, pass an empty string.

ContentPart

Base for any content part; discriminated by type.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseAudioContentPart, RequestAudioContentPart, RequestTextContentPart, ResponseTextContentPart

ConversationItemBase

The item to add to the conversation.

ConversationRequestItem

Base for any response item; discriminated by type.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionCallItem, FunctionCallOutputItem, MessageItem

EouDetection

Top-level union for end-of-utterance (EOU) semantic detection configuration.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticDetection, AzureSemanticDetectionEn, AzureSemanticDetectionMultilingual

ErrorResponse

Standard error response envelope.

FunctionCallItem

A function call item within a conversation.

FunctionCallOutputItem

A function call output item within a conversation.

FunctionTool

The definition of a function tool as used by the voicelive endpoint.

IceServer

ICE server configuration for WebRTC connection negotiation.

InputAudioContentPart

Input audio content part.

InputTextContentPart

Input text content part.

InputTokenDetails

Details of input token usage.

LogProbProperties

A single log probability entry for a token.

MessageContentPart

Base for any message content part; discriminated by type.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: InputAudioContentPart, InputTextContentPart, OutputTextContentPart

MessageItem

A message item within a conversation.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: AssistantMessageItem, SystemMessageItem, UserMessageItem

OpenAIVoice

OpenAI voice configuration with explicit type field. This provides a unified interface for OpenAI voices, complementing the existing string-based OpenAIVoiceName for backward compatibility.

OutputTextContentPart

Output text content part.

OutputTokenDetails

Details of output token usage.

RequestAudioContentPart

An audio content part for a request.

RequestSession

Base for session configuration shared between request and response.

RequestTextContentPart

A text content part for a request.

Response

The response resource.

ResponseAudioContentPart

An audio content part for a response.

ResponseCancelledDetails

Details for a cancelled response.

ResponseCreateParams

Create a new VoiceLive response with these parameters.

ResponseFailedDetails

Details for a failed response.

ResponseFunctionCallItem

A function call item within a conversation.

ResponseFunctionCallOutputItem

A function call output item within a conversation.

ResponseIncompleteDetails

Details for an incomplete response.

ResponseItem

Base for any response item; discriminated by type.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseFunctionCallItem, ResponseFunctionCallOutputItem, ResponseMessageItem

ResponseMessageItem

Base type for message item within a conversation.

ResponseSession

Base for session configuration in the response.

ResponseStatusDetails

Base for all non-success response details.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseCancelledDetails, ResponseFailedDetails, ResponseIncompleteDetails

ResponseTextContentPart

A text content part for a response.

ServerEvent

A voicelive server event.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: ServerEventConversationItemCreated, ServerEventConversationItemDeleted, ServerEventConversationItemInputAudioTranscriptionCompleted, ServerEventConversationItemInputAudioTranscriptionDelta, ServerEventConversationItemInputAudioTranscriptionFailed, ServerEventConversationItemRetrieved, ServerEventConversationItemTruncated, ServerEventError, ServerEventInputAudioBufferCleared, ServerEventInputAudioBufferCommitted, ServerEventInputAudioBufferSpeechStarted, ServerEventInputAudioBufferSpeechStopped, ServerEventResponseAnimationBlendshapeDelta, ServerEventResponseAnimationBlendshapeDone, ServerEventResponseAnimationVisemeDelta, ServerEventResponseAnimationVisemeDone, ServerEventResponseAudioDelta, ServerEventResponseAudioDone, ServerEventResponseAudioTimestampDelta, ServerEventResponseAudioTimestampDone, ServerEventResponseAudioTranscriptDelta, ServerEventResponseAudioTranscriptDone, ServerEventResponseContentPartAdded, ServerEventResponseContentPartDone, ServerEventResponseCreated, ServerEventResponseDone, ServerEventResponseFunctionCallArgumentsDelta, ServerEventResponseFunctionCallArgumentsDone, ServerEventResponseOutputItemAdded, ServerEventResponseOutputItemDone, ServerEventResponseTextDelta, ServerEventResponseTextDone, ServerEventSessionAvatarConnecting, ServerEventSessionCreated, ServerEventSessionUpdated

ServerEventConversationItemCreated

Returned when a conversation item is created. There are several scenarios that produce this event:

The server is generating a Response, which if successful will produce either one or two Items, which will be of type message (role assistant) or type function_call. The input audio buffer has been committed, either by the client or the server (in server_vad mode). The server will take the content of the input audio buffer and add it to a new user message Item. The client has sent a conversation.item.create event to add a new Item to the Conversation.

ServerEventConversationItemDeleted

Returned when an item in the conversation is deleted by the client with a conversation.item.delete event. This event is used to synchronize the server's understanding of the conversation history with the client's view.

ServerEventConversationItemInputAudioTranscriptionCompleted

This event is the output of audio transcription for user audio written to the user audio buffer. Transcription begins when the input audio buffer is committed by the client or server (in server_vad mode). Transcription runs asynchronously with Response creation, so this event may come before or after the Response events. VoiceLive API models accept audio natively, and thus input transcription is a separate process run on a separate ASR (Automatic Speech Recognition) model. The transcript may diverge somewhat from the model's interpretation, and should be treated as a rough guide.

ServerEventConversationItemInputAudioTranscriptionDelta

Returned when the text value of an input audio transcription content part is updated.

ServerEventConversationItemInputAudioTranscriptionFailed

Returned when input audio transcription is configured, and a transcription request for a user message failed. These events are separate from other error events so that the client can identify the related Item.

ServerEventConversationItemRetrieved

Returned when a conversation item is retrieved with conversation.item.retrieve.

ServerEventConversationItemTruncated

Returned when an earlier assistant audio message item is truncated by the client with a conversation.item.truncate event. This event is used to synchronize the server's understanding of the audio with the client's playback. This action will truncate the audio and remove the server-side text transcript to ensure there is no text in the context that hasn't been heard by the user.

ServerEventError

Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.

ServerEventErrorDetails

Details of the error.

ServerEventInputAudioBufferCleared

Returned when the input audio buffer is cleared by the client with a input_audio_buffer.clear event.

ServerEventInputAudioBufferCommitted

Returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The item_id property is the ID of the user message item that will be created, thus a conversation.item.created event will also be sent to the client.

ServerEventInputAudioBufferSpeechStarted

Sent by the server when in server_vad mode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user. The client should expect to receive a input_audio_buffer.speech_stopped event when speech stops. The item_id property is the ID of the user message item that will be created when speech stops and will also be included in the input_audio_buffer.speech_stopped event (unless the client manually commits the audio buffer during VAD activation).

ServerEventInputAudioBufferSpeechStopped

Returned in server_vad mode when the server detects the end of speech in the audio buffer. The server will also send an conversation.item.created event with the user message item that is created from the audio buffer.

ServerEventResponseAnimationBlendshapeDelta

Represents a delta update of blendshape animation frames for a specific output of a response.

ServerEventResponseAnimationBlendshapeDone

Indicates the completion of blendshape animation processing for a specific output of a response.

ServerEventResponseAnimationVisemeDelta

Represents a viseme ID delta update for animation based on audio.

ServerEventResponseAnimationVisemeDone

Indicates completion of viseme animation delivery for a response.

ServerEventResponseAudioDelta

Returned when the model-generated audio is updated.

ServerEventResponseAudioDone

Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled.

ServerEventResponseAudioTimestampDelta

Represents a word-level audio timestamp delta for a response.

ServerEventResponseAudioTimestampDone

Indicates completion of audio timestamp delivery for a response.

ServerEventResponseAudioTranscriptDelta

Returned when the model-generated transcription of audio output is updated.

ServerEventResponseAudioTranscriptDone

Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.

ServerEventResponseContentPartAdded

Returned when a new content part is added to an assistant message item during response generation.

ServerEventResponseContentPartDone

Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled.

ServerEventResponseCreated

Returned when a new Response is created. The first event of response creation, where the response is in an initial state of in_progress.

ServerEventResponseDone

Returned when a Response is done streaming. Always emitted, no matter the final state. The Response object included in the response.done event will include all output Items in the Response but will omit the raw audio data.

ServerEventResponseFunctionCallArgumentsDelta

Returned when the model-generated function call arguments are updated.

ServerEventResponseFunctionCallArgumentsDone

Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.

ServerEventResponseOutputItemAdded

Returned when a new Item is created during Response generation.

ServerEventResponseOutputItemDone

Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.

ServerEventResponseTextDelta

Returned when the text value of a "text" content part is updated.

ServerEventResponseTextDone

Returned when the text value of a "text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.

ServerEventSessionAvatarConnecting

Sent when the server is in the process of establishing an avatar media connection and provides its SDP answer.

ServerEventSessionCreated

Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration.

ServerEventSessionUpdated

Returned when a session is updated with a session.update event, unless there is an error.

ServerVad

Base model for VAD-based turn detection.

SessionBase

VoiceLive session object configuration.

SystemMessageItem

A system message item within a conversation.

TokenUsage

Overall usage statistics for a response.

Tool

The base representation of a voicelive tool definition.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionTool

ToolChoiceFunctionSelection

The representation of a voicelive tool_choice selecting a named function tool.

ToolChoiceSelection

A base representation for a voicelive tool_choice selecting a named tool.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: ToolChoiceFunctionSelection

TurnDetection

Top-level union for turn detection configuration.

You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticVad, AzureSemanticVadEn, AzureSemanticVadMultilingual, ServerVad

UserMessageItem

A user message item within a conversation.

VideoCrop

Defines a video crop rectangle using top-left and bottom-right coordinates.

VideoParams

Video streaming parameters for avatar.

VideoResolution

Resolution of the video feed in pixels.

VoiceLiveErrorDetails

Error object returned in case of API failure.

Enums

AnimationOutputType

Specifies the types of animation data to output.

AudioTimestampType

Output timestamp types supported in audio response content.

AzureVoiceType

Union of all supported Azure voice types.

ClientEventType

Client event types used in VoiceLive protocol.

ContentPartType

Type of ContentPartType.

EouThresholdLevel

Threshold level settings for Azure semantic end-of-utterance detection.

InputAudioFormat

Input audio format types supported.

ItemParamStatus

Indicates the processing status of an item or parameter.

ItemType

Type of ItemType.

MessageRole

Type of MessageRole.

Modality

Supported modalities for the session.

OpenAIVoiceName

Supported OpenAI voice names (string enum).

OutputAudioFormat

Output audio format types supported.

PersonalVoiceModels

PersonalVoice models.

ResponseItemStatus

Indicates the processing status of a response item.

ResponseStatus

Terminal status of a response.

ServerEventType

Server event types used in VoiceLive protocol.

ToolChoiceLiteral

The available set of mode-level, string literal tool_choice options for the voicelive endpoint.

ToolType

The supported tool type discriminators for voicelive tools. Currently, only 'function' tools are supported.

TurnDetectionType

Type of TurnDetectionType.