Yes, direct, authenticated real-time streaming of Azure OpenAI/GPT to a web frontend is a supported scenario. You can utilize the Realtime API via WebRTC or WebSockets for this purpose. The Realtime API allows for low-latency, "speech in, speech out" conversational interactions, which is suitable for your use case.
To securely manage authentication in a client-side environment, you can use ephemeral tokens. These tokens can be generated through a REST API call to your Azure OpenAI resource, which can be authenticated using either an API key or Microsoft Entra ID token. This approach helps to avoid exposing sensitive, long-term API keys in your frontend application.
Here’s a general outline of the steps you can follow:
- Set up a token service: Create a service that retrieves ephemeral tokens using the Azure OpenAI REST API. This service will authenticate using an API key or Microsoft Entra ID token.
- Integrate with your frontend: Your Next.js application can call this token service to obtain the ephemeral token and then use it to establish a WebRTC or WebSocket connection to the Realtime API.
- Implement the connection: Use the ephemeral token to authenticate the WebRTC or WebSocket connection, ensuring that you handle the streaming of audio and text appropriately.
While I cannot provide a specific example implementation in Next.js/TypeScript, following the steps outlined in the Azure documentation for setting up the Realtime API and managing ephemeral tokens will guide you in the right direction.
References: