Azure OpenAI: GPT-4o deployment has a 2 minute hard timeout via API call

Question

Azure OpenAI: GPT-4o deployment has a 2 minute hard timeout via API call

Adriaan 40

Hi, we have a deployment of GPT-4o in Azure that's acting strangely in comparison to GPT-4 Turbo.

We access this deployment via either Semantic Kernel or via the Azure AI SDK using dotnet, depending on the use case, but both frameworks obviously invoke the same API calls for chat completion.
Regardless, when prompting these models via the chat completion API there is a default response timeout of 2 minutes.
This timeout can be overridden via the OpenAI client options, and has always worked with GPT 3.5, 3.5 Turbo, 4 and 4 Turbo.

Since moving to GPT-4o, overriding this timeout no longer appears to have any effect, and should your completion generation take over 2 minutes, the connection from the host will be cut at exactly 2 minutes.

This behaviour occurs regardless of the framework being used and can be replicated by calling the chat completion API directly (all API versions appear to have this issue) or via the Chat Playground in Azure OpenAI Studio.

Is this a bug / issue with Azure's implementation of GPT-4o? Or is this a bug / issue on OpenAI's side?

Kindly assist.

navba-MSFT 27,605 Reputation points Microsoft Employee Moderator

2024-06-04T09:52:53.61+00:00

@Adriaan Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Since you mentioned the connection from the host will be cut , Are you encountering any errors / exceptions due to this ?

.

It is possible to set the timeout longer, you should be able to set it up to 10 minutes (600 seconds). However, the best way to avoid timeout errors is to use streaming as it will avoid requests waiting for the full response and then resulting in a 408.

More info: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.

.

Hope this helps.
Adriaan 40 Reputation points

2024-06-04T10:56:24.18+00:00

Hi there, thank you for taking the time to provide some assistance.

We have actually previously used the method from your suggestion to success with the older GPT models such as GPT-4 Turbo, GPT-4, etc.
But it appears that the setting has no effect with our deployed GPT-4o model. Re-deploying the model has no effect either. It simply cuts / times the connection out at exactly the 2 minute mark during response generation.

In regards to the error we are receiving, it is just a generic "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host." exception produced by the framework.

Also, as I mentioned, this issue can be replicated in the Azure OpenAI Studio's Chat Playground, which by default employs the IsStreaming flag. At the 2 minute mark, response generation simply cuts out / stops with a "Completions call failed. Please try again." message appended.
Adriaan 40 Reputation points

2024-06-04T11:04:59.3766667+00:00

Edit: Duplicate...
navba-MSFT 27,605 Reputation points Microsoft Employee Moderator

2024-06-06T04:48:01.64+00:00

@Adriaan Thanks for your reply. I am checking this internally. I will get back once I have an update on this.
Adriaan 40 Reputation points

2025-12-03T11:58:05.58+00:00

@navba-MSFT Good day, kindly note that we are now experiencing something similar with GPT-5.1 deployments in Azure AI Foundry. Not sure if this has already been raised elsewhere. But, unfortunately we are once again experiencing a hard timeout limit of 5 minutes with these new models.

Answer accepted by question author

0 additional answers

Your answer

navba-MSFT 27,605 Reputation points Microsoft Employee Moderator

2024-06-04T09:52:53.61+00:00

@Adriaan Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Since you mentioned the connection from the host will be cut , Are you encountering any errors / exceptions due to this ?

.

It is possible to set the timeout longer, you should be able to set it up to 10 minutes (600 seconds). However, the best way to avoid timeout errors is to use streaming as it will avoid requests waiting for the full response and then resulting in a 408.

More info: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.

.

Hope this helps.
Adriaan 40 Reputation points

2024-06-04T10:56:24.18+00:00

Hi there, thank you for taking the time to provide some assistance.

We have actually previously used the method from your suggestion to success with the older GPT models such as GPT-4 Turbo, GPT-4, etc.
But it appears that the setting has no effect with our deployed GPT-4o model. Re-deploying the model has no effect either. It simply cuts / times the connection out at exactly the 2 minute mark during response generation.

In regards to the error we are receiving, it is just a generic "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host." exception produced by the framework.

Also, as I mentioned, this issue can be replicated in the Azure OpenAI Studio's Chat Playground, which by default employs the IsStreaming flag. At the 2 minute mark, response generation simply cuts out / stops with a "Completions call failed. Please try again." message appended.
Adriaan 40 Reputation points

2024-06-04T11:04:59.3766667+00:00

Edit: Duplicate...
navba-MSFT 27,605 Reputation points Microsoft Employee Moderator

2024-06-06T04:48:01.64+00:00

@Adriaan Thanks for your reply. I am checking this internally. I will get back once I have an update on this.
Adriaan 40 Reputation points

2025-12-03T11:58:05.58+00:00

@navba-MSFT Good day, kindly note that we are now experiencing something similar with GPT-5.1 deployments in Azure AI Foundry. Not sure if this has already been raised elsewhere. But, unfortunately we are once again experiencing a hard timeout limit of 5 minutes with these new models.

Answer 1

navba-MSFT 27,605 Microsoft Employee Moderator

@Adriaan Apologies for the late reply. I appreciate your patience on this.

.

The product Owners were involved in the background, to look into this timeout issue.

.

There was a known issue for the GPT4o causing timeout. The cause has been identified.

Update:

The fix will be deployed to all regions by end of this week.

.

Post that you can test again and let me know how it goes.

**

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Adriaan 40 Reputation points

2024-06-10T07:29:23.22+00:00

@navba-MSFT Thank you so much for the prompt assistance wit this. It really is appreciated.

It appears that the fix has already been deployed to our region and response generation no longer cuts off at the 2 minute mark. So this has been fixed.
Johan Yman 5 Reputation points

2024-06-19T14:13:13.4+00:00

We still have problems with timeouts in Sweden central
Yl 20 Reputation points

2025-02-04T07:12:09.5833333+00:00

Hi, we are encountering the same problem for DeepSeek-R1, can you please fix the issue for this model as well?
Yong Wang 5 Reputation points

2025-02-06T03:40:36.07+00:00

Hi, we are encountering the same problem for DeepSeek-R1, can you please fix the issue for this model as well?
Yong Wang 5 Reputation points

2025-02-06T03:40:54.9466667+00:00
HttpResponseError('(Timeout) The operation was timeout.\nCode: Timeout\nMessage: The operation was timeout.')Traceback (most recent call last):

File "/opt/homebrew/Caskroom/miniconda/base/envs/codgine-dev/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 690, in generate

self._generate_with_cache(

File "/opt/homebrew/Caskroom/miniconda/base/envs/codgine-dev/lib/python3.12/site-packages/langchain_core/language_models/chat_models.py", line 925, in _generate_with_cache

result = self._generate( ^^^^^^^^^^^^^^^

File "/opt/homebrew/Caskroom/miniconda/base/envs/codgine-dev/lib/python3.12/site-packages/langchain_azure_ai/chat_models/inference.py", line 457, in _generate

response = self._client.complete( ^^^^^^^^^^^^^^^^^^^^^^

File "/opt/homebrew/Caskroom/miniconda/base/envs/codgine-dev/lib/python3.12/site-packages/azure/ai/inference/_patch.py", line 731, in complete

raise HttpResponseError(response=response)

azure.core.exceptions.HttpResponseError: (Timeout) The operation was timeout.

Code: Timeout

Message: The operation was timeout.

Share via

Azure OpenAI: GPT-4o deployment has a 2 minute hard timeout via API call

0 additional answers

Your answer