你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

Azure OpenAI 推理模型

Azure OpenAI 推理模型旨在处理推理任务和解决问题的任务，具有更好的针对性和功能。与之前的迭代相比，这些模型花费更多的时间处理和理解用户的请求，使它们在科学、编码和数学等领域非常强大。

推理模型的主要功能：

复杂代码生成：能够生成算法并处理高级编码任务以支持开发人员。
高级问题解决：非常适合全面的头脑风暴会议和解决多方面的挑战。
复杂文档比较：非常适合分析合同、案例文件或法律文档以识别细微的差别。
指令遵循和工作流管理：对于管理需要较短上下文的工作流特别有效。

用法

这些模型当前不支持与使用聊天完成 API 的其他模型相同的参数集。

聊天补全 API

using Azure.Identity;
using OpenAI;
using OpenAI.Chat;
using System.ClientModel.Primitives;

#pragma warning disable OPENAI001 //currently required for token based authentication

BearerTokenPolicy tokenPolicy = new(
    new DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default");

ChatClient client = new(
    model: "o4-mini",
    authenticationPolicy: tokenPolicy,
    options: new OpenAIClientOptions()
    {

        Endpoint = new Uri("https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1")
    }
);

ChatCompletionOptions options = new ChatCompletionOptions
{
    MaxOutputTokenCount = 100000
};

ChatCompletion completion = client.CompleteChat(
         new DeveloperChatMessage("You are a helpful assistant"),
         new UserChatMessage("Tell me about the bitter lesson")
    );

Console.WriteLine($"[ASSISTANT]: {completion.Content[0].Text}");

Microsoft Entra ID：

如果不熟悉如何使用 Microsoft Entra ID 进行身份验证，请参阅如何使用 Microsoft Entra ID 身份验证在 Microsoft Foundry 模型中配置 Azure OpenAI。

from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = OpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",  
  api_key=token_provider,
)

response = client.chat.completions.create(
    model="o1-new", # replace with your model deployment name 
    messages=[
        {"role": "user", "content": "What steps should I think about when writing my first Python API?"},
    ],
    max_completion_tokens = 5000

)

print(response.model_dump_json(indent=2))

API 密钥：

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
)

response = client.chat.completions.create(
    model="gpt-5-mini", # replace with the model deployment name of your o1 deployment.
    messages=[
        {"role": "user", "content": "What steps should I think about when writing my first Python API?"},
    ],
    max_completion_tokens = 5000

)

print(response.model_dump_json(indent=2))

curl -X POST "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \
  -d '{
      "model": "gpt-5",
      "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What steps should I think about when writing my first Python API?"}
      ],
      "max_completion_tokens": 1000
  }'

Python 聊天完成 API 输出：

{
  "id": "chatcmpl-AEj7pKFoiTqDPHuxOcirA9KIvf3yz",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Writing your first Python API is an exciting step in developing software that can communicate with other applications. An API (Application Programming Interface) allows different software systems to interact with each other, enabling data exchange and functionality sharing. Here are the steps you should consider when creating your first Python API...truncated for brevity.",
        "refusal": null,
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "protected_material_code": {
          "filtered": false,
          "detected": false
        },
        "protected_material_text": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ],
  "created": 1728073417,
  "model": "o1-2024-12-17",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": "fp_503a95a7d8",
  "usage": {
    "completion_tokens": 1843,
    "prompt_tokens": 20,
    "total_tokens": 1863,
    "completion_tokens_details": {
      "audio_tokens": null,
      "reasoning_tokens": 448
    },
    "prompt_tokens_details": {
      "audio_tokens": null,
      "cached_tokens": 0
    }
  },
  "prompt_filter_results": [
    {
      "prompt_index": 0,
      "content_filter_results": {
        "custom_blocklists": {
          "filtered": false
        },
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ]
}

推理工作

注释

推理模型在模型响应中将 reasoning_tokens 作为 completion_tokens_details 的一部分。这些是隐藏的标记，不会作为消息响应内容的一部分返回，但模型会使用它们来帮助生成对请求的最终答复。对于除 reasoning_effort 之外的所有推理模型，low 都可以设置为 medium、high 或 o1-mini。对于 reasoning_effort，GPT-5 推理模型支持一项新设置，即 minimal。工作量设置越高，模型处理请求所花费的时间就越长，这通常会产生更多的 reasoning_tokens。

开发人员消息

从功能上来说，开发人员消息 "role": "developer" 与系统消息相同。

在上一代码示例中添加开发人员消息，如下所示：


using Azure.Identity;
using OpenAI;
using OpenAI.Chat;
using System.ClientModel.Primitives;

#pragma warning disable OPENAI001 //currently required for token based authentication

BearerTokenPolicy tokenPolicy = new(
    new DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default");

ChatClient client = new(
    model: "o4-mini",
    authenticationPolicy: tokenPolicy,
    options: new OpenAIClientOptions()
    {

        Endpoint = new Uri("https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1")
    }
);

ChatCompletionOptions options = new ChatCompletionOptions
{
    ReasoningEffortLevel = ChatReasoningEffortLevel.Low,
    MaxOutputTokenCount = 100000
};

ChatCompletion completion = client.CompleteChat(
         new DeveloperChatMessage("You are a helpful assistant"),
         new UserChatMessage("Tell me about the bitter lesson")
    );

Console.WriteLine($"[ASSISTANT]: {completion.Content[0].Text}");

Microsoft Entra ID：

如果不熟悉如何使用 Microsoft Entra ID 进行身份验证，请参阅如何使用 Microsoft Entra ID 身份验证配置 Azure OpenAI。

jupy

API 密钥：

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
)

response = client.chat.completions.create(
    model="gpt-5-mini", # replace with the model deployment name of your o1 deployment.
    messages=[
        {"role": "developer","content": "You are a helpful assistant."}, # optional equivalent to a system message for reasoning models 
        {"role": "user", "content": "What steps should I think about when writing my first Python API?"},
    ],
    max_completion_tokens = 5000,
    reasoning_effort = "medium" # low, medium, or high
)

print(response.model_dump_json(indent=2))

curl -X POST "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \
  -d '{
      "model": "gpt-5",
      "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What steps should I think about when writing my first Python API?"}
      ],
      "max_completion_tokens": 1000,
      "reasoning_effort": "medium"
  }'

Python 聊天完成 API 输出：

{
  "id": "chatcmpl-CaODNsQOHoRLcb9JVSKYY1e2Iss5s",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Here’s a practical, beginner‑friendly checklist to guide you through writing your first Python API, from idea to production.\n\n1) Clarify goals and constraints\n- Who will use it (internal team, public), what problems it solves, expected traffic, latency requirements.\n- Resources you’ll expose (users, orders, etc.) and core operations.\n- Non‑functional needs: security, compliance, uptime, scalability.\n\n2) Choose your API style\n- REST (most common for CRUD and simple integrations).\n- GraphQL (flexible queries, more complex to secure/monitor).\n- gRPC (high‑performance, strongly typed, good for service‑to‑service).\n- For a first API, REST + JSON is usually best.\n\n3) Design the contract first\n- Draft an OpenAPI/Swagger spec: endpoints, request/response schemas, status codes, error model.\n- Decide naming conventions, pagination, filtering, sorting.\n- Define consistent time/date format (ISO‑8601, UTC), ID format, and field casing.\n- Plan versioning strategy (e.g., /v1) and deprecation policy.\n\n4) Plan security and auth\n- Pick auth: API keys for simple internal use; OAuth2/JWT for user auth; mTLS for service‑to‑service.\n- CORS policy for browsers; HTTPS everywhere; security headers.\n- Validate all inputs; avoid leaking stack traces; define rate limits and quotas.\n\n5) Pick your Python stack\n- Frameworks: FastAPI (great typing, validation, auto docs), Flask (minimal), Django REST Framework (batteries included).\n- ASGI/WSGI server: Uvicorn or Gunicorn.\n- Data layer: PostgreSQL + SQLAlchemy/Django ORM; migrations with Alembic/Django migrations.\n- Caching: Redis (optional).\n- Background jobs: Celery/RQ (if needed).\n\n6) Set up the project\n- Create a virtual environment; choose dependency management (pip, Poetry).\n- Establish project structure (app, api, models, services, tests).\n- Add linting/formatting/type checks: black, isort, flake8, mypy; pre‑commit hooks.\n- Configuration via environment variables; secrets via a manager (not in code).\n\n7) Implement core functionality\n- Build endpoints that match your spec; keep business logic in a service layer, not in route handlers.\n- Schema validation (Pydantic with FastAPI, Marshmallow for Flask).\n- Consistent responses and errors; use clear status codes (201 create, 204 no content, 400/404/409/422, 500).\n- Pagination and filtering; idempotency for certain POST operations; ETags/conditional requests if useful.\n\n8) Error handling and an error model\n- Define a standard error body (code, message, details, correlation_id).\n- Log errors with context; don’t expose internal details to clients.\n\n9) Testing strategy\n- Unit tests for services/validators.\n- Integration tests for endpoints (pytest + httpx/requests) with a test database.\n- Contract tests to assert the API matches the OpenAPI spec.\n- Mock external services; measure coverage and focus on critical paths.\n\n10) Documentation and developer experience\n- Auto‑generated docs (FastAPI provides Swagger/ReDoc).\n- Write examples for each endpoint; onboarding and usage notes.\n- Keep a changelog and release notes.\n\n11) Observability and reliability\n- Structured logging (JSON), include request IDs/correlation IDs.\n- Metrics (requests, latency, error rates), health/readiness endpoints.\n- Tracing (OpenTelemetry) if you have multiple services.\n- Error reporting (Sentry or similar).\n\n12) Deployment and operations\n- Containerize with Docker; follow 12‑factor app principles.\n- CI/CD pipeline: run tests, build image, deploy, run migrations.\n- Choose hosting (Render, Fly.io, Railway, Heroku, AWS/GCP/Azure).\n- Configure scaling, connection pools, and timeouts; use a reverse proxy if needed.\n\n13) Performance and data concerns\n- Index your database; avoid N+1 queries; use connection pooling.\n- Load test key endpoints; profile hotspots.\n- Caching strategies where appropriate; consider async I/O for high‑concurrency workloads.\n\n14) Versioning and lifecycle management\n- Keep backward compatibility for minor changes; add fields rather than changing semantics.\n- Communicate deprecations; sunset old versions with a timeline.\n\n15) Governance, compliance, and safety\n- Handle PII correctly; data retention and audit logs if required.\n- Least‑privilege DB access; rotate secrets; review third‑party dependencies.\n\nBeginner‑friendly defaults\n- FastAPI + Pydantic + Uvicorn\n- PostgreSQL + SQLAlchemy + Alembic\n- pytest + httpx + coverage\n- black, isort, flake8, mypy, pre‑commit\n- Docker + simple CI (GitHub Actions) + a managed host\n\nCommon pitfalls to avoid\n- Inconsistent status codes or error formats.\n- Weak input validation and missing authentication.\n- Business logic inside route handlers (hard to test/maintain).\n- No migrations or tests; no logging/metrics.\n- Ignoring pagination and timezones; returning unbounded lists.\n\nIf you share whether it’s public vs internal, expected traffic, and preferred framework, I can tailor this to a concrete starter plan and recommended tools.",
        "refusal": null,
        "role": "assistant",
        "annotations": [],
        "audio": null,
        "function_call": null,
        "tool_calls": null
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "protected_material_code": {
          "filtered": false,
          "detected": false
        },
        "protected_material_text": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ],
  "created": 1762788925,
  "model": "gpt-5-2025-08-07",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 2919,
    "prompt_tokens": 29,
    "total_tokens": 2948,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 1792,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    }
  },
  "prompt_filter_results": [
    {
      "prompt_index": 0,
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ]
}

推理摘要

将最新的推理模型与响应 API 结合使用时，可以使用推理摘要参数来接收模型思维链推理的摘要。

重要

除通过推理摘要参数外，不支持尝试使用其他方法提取原始推理，这样做可能导致违反可接受的使用策略，一旦检测到就可能导致限流或帐户停用。

using OpenAI;
using OpenAI.Responses;
using System.ClientModel.Primitives;
using Azure.Identity;

#pragma warning disable OPENAI001 //currently required for token based authentication

BearerTokenPolicy tokenPolicy = new(
    new DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default");

OpenAIResponseClient client = new(
    model: "o4-mini",
    authenticationPolicy: tokenPolicy,
    options: new OpenAIClientOptions()
    {
        Endpoint = new Uri("https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1")
    }
);

OpenAIResponse response = await client.CreateResponseAsync(
    userInputText: "What's the optimal strategy to win at poker?",
    new ResponseCreationOptions()
    {
        ReasoningOptions = new ResponseReasoningOptions()
        {
            ReasoningEffortLevel = ResponseReasoningEffortLevel.High,
            ReasoningSummaryVerbosity = ResponseReasoningSummaryVerbosity.Auto,
        },
    });

// Get the reasoning summary from the first OutputItem (ReasoningResponseItem)
Console.WriteLine("=== Reasoning Summary ===");
foreach (var item in response.OutputItems)
{
    if (item is ReasoningResponseItem reasoningItem)
    {
        foreach (var summaryPart in reasoningItem.SummaryParts)
        {
            if (summaryPart is ReasoningSummaryTextPart textPart)
            {
                Console.WriteLine(textPart.Text);
            }
        }
    }
}

Console.WriteLine("\n=== Assistant Response ===");
// Get the assistant's output
Console.WriteLine(response.GetOutputText());

需要升级 OpenAI 客户端库才能访问最新的参数。

pip install openai --upgrade

Microsoft Entra ID：

from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = OpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",  
  api_key=token_provider,
)

response = client.responses.create(
    input="Tell me about the curious case of neural text degeneration",
    model="gpt-5", # replace with model deployment name
    reasoning={
        "effort": "medium",
        "summary": "auto" # auto, concise, or detailed, gpt-5 series do not support concise 
    },
    text={
        "verbosity": "low" # New with GPT-5 models
    }
)

print(response.model_dump_json(indent=2))

API 密钥：

import os
from openai import OpenAI

client = OpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
  api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)

response = client.responses.create(
    input="Tell me about the curious case of neural text degeneration",
    model="gpt-5", # replace with model deployment name
    reasoning={
        "effort": "medium",
        "summary": "auto" # auto, concise, or detailed, gpt-5 series do not support concise 
    },
    text={
        "verbosity": "low" # New with GPT-5 models
    }
)

print(response.model_dump_json(indent=2))

curl -X POST "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/responses" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \
 -d '{
     "model": "gpt-5",
     "input": "Tell me about the curious case of neural text degeneration",
     "reasoning": {"summary": "auto"},
     "text": {"verbosity": "low"}
    }'

{
  "id": "resp_689a0a3090808190b418acf12b5cc40e0fc1c31bc69d8719",
  "created_at": 1754925616.0,
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "metadata": {},
  "model": "gpt-5",
  "object": "response",
  "output": [
    {
      "id": "rs_689a0a329298819095d90c34dc9b80db0fc1c31bc69d8719",
      "summary": [],
      "type": "reasoning",
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_689a0a33009881909fe0fcf57cba30200fc1c31bc69d8719",
      "content": [
        {
          "annotations": [],
          "text": "Neural text degeneration refers to the ways language models produce low-quality, repetitive, or vacuous text, especially when generating long outputs. It’s “curious” because models trained to imitate fluent text can still spiral into unnatural patterns. Key aspects:\n\n- Repetition and loops: The model repeats phrases or sentences (“I’m sorry, but...”), often due to high-confidence tokens reinforcing themselves.\n- Loss of specificity: Vague, generic, agreeable text that avoids concrete details.\n- Drift and contradiction: The output gradually departs from context or contradicts itself over long spans.\n- Exposure bias: During training, models see gold-standard prefixes; at inference, they must condition on their own imperfect outputs, compounding errors.\n- Likelihood vs. quality mismatch: Maximizing token-level likelihood doesn’t align with human preferences for diversity, coherence, or factuality.\n- Token over-optimization: Frequent, safe tokens get overused; certain phrases become attractors.\n- Entropy collapse: With greedy or low-temperature decoding, the distribution narrows too much, causing repetitive, low-entropy text.\n- Length and beam search issues: Larger beams or long generations can favor bland, repetitive sequences (the “likelihood trap”).\n\nCommon mitigations:\n\n- Decoding strategies:\n  - Top-k, nucleus (top-p), or temperature sampling to keep sufficient entropy.\n  - Typical sampling and locally typical sampling to avoid dull but high-probability tokens.\n  - Repetition penalties, presence/frequency penalties, no-repeat n-grams.\n  - Contrastive decoding (and variants like DoLa) to filter generic continuations.\n  - Min/max length, stop sequences, and beam search with diversity/penalties.\n\n- Training and alignment:\n  - RLHF/DPO to better match human preferences for non-repetitive, helpful text.\n  - Supervised fine-tuning on high-quality, diverse data; instruction tuning.\n  - Debiasing objectives (unlikelihood training) to penalize repetition and banned patterns.\n  - Mixture-of-denoisers or latent planning to improve long-range coherence.\n\n- Architectural and planning aids:\n  - Retrieval-augmented generation to ground outputs.\n  - Tool use and structured prompting to constrain drift.\n  - Memory and planning modules, hierarchical decoding, or sentence-level control.\n\n- Prompting tips:\n  - Ask for concise answers, set token limits, and specify structure.\n  - Provide concrete constraints or content to reduce generic filler.\n  - Use “say nothing if uncertain” style instructions to avoid vacuity.\n\nRepresentative papers/terms to search:\n- Holtzman et al., “The Curious Case of Neural Text Degeneration” (2020): nucleus sampling.\n- Welleck et al., “Neural Text Degeneration with Unlikelihood Training.”\n- Li et al., “A Contrastive Framework for Decoding.”\n- Su et al., “DoLa: Decoding by Contrasting Layers.”\n- Meister et al., “Typical Decoding.”\n- Ouyang et al., “Training language models to follow instructions with human feedback.”\n\nIn short, degeneration arises from a mismatch between next-token likelihood and human preferences plus decoding choices; careful decoding, training objectives, and grounding help prevent it.",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": null,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "prompt_cache_key": null,
  "reasoning": {
    "effort": "minimal",
    "generate_summary": null,
    "summary": "detailed"
  },
  "safety_identifier": null,
  "service_tier": "default",
  "status": "completed",
  "text": {
    "format": {
      "type": "text"
    }
  },
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 16,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 657,
    "output_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 673
  },
  "user": null,
  "content_filters": null,
  "store": true
}

注释

即使启用，也不能保证为每个步骤/请求生成推理摘要。这是预期的行为。

Python Lark (拉克)

GPT-5 系列推理模型能够调用名为 custom_tool 的新 lark_tool。该工具基于 Python lark 开发，可用于对模型输出进行更灵活的限制。

响应 API

{
  "model": "gpt-5-2025-08-07",
  "input": "please calculate the area of a circle with radius equal to the number of 'r's in strawberry",
  "tools": [
    {
      "type": "custom",
      "name": "lark_tool",
      "format": {
        "type": "grammar",
        "syntax": "lark",
        "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/"
      }
    }
  ],
  "tool_choice": "required"
}

Microsoft Entra ID：

from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = OpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",  
  api_key=token_provider,
)

response = client.responses.create(  
    model="gpt-5",  # replace with your model deployment name  
    tools=[  
        {  
            "type": "custom",
            "name": "lark_tool",
            "format": {
                "type": "grammar",
                "syntax": "lark",
                "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/"
            }
        }  
    ],  
    input=[{"role": "user", "content": "Please calculate the area of a circle with radius equal to the number of 'r's in strawberry"}],  
)  

print(response.model_dump_json(indent=2))

API 密钥：

import os
from openai import OpenAI

client = OpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
  api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)

response = client.responses.create(  
    model="gpt-5",  # replace with your model deployment name  
    tools=[  
        {  
            "type": "custom",
            "name": "lark_tool",
            "format": {
                "type": "grammar",
                "syntax": "lark",
                "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/"
            }
        }  
    ],  
    input=[{"role": "user", "content": "Please calculate the area of a circle with radius equal to the number of 'r's in strawberry"}],  
)  

print(response.model_dump_json(indent=2))

输出：

{
  "id": "resp_689a0cf927408190b8875915747667ad01c936c6ffb9d0d3",
  "created_at": 1754926332.0,
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "metadata": {},
  "model": "gpt-5",
  "object": "response",
  "output": [
    {
      "id": "rs_689a0cfd1c888190a2a67057f471b5cc01c936c6ffb9d0d3",
      "summary": [],
      "type": "reasoning",
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_689a0d00e60c81908964e5e9b2d6eeb501c936c6ffb9d0d3",
      "content": [
        {
          "annotations": [],
          "text": "“strawberry” has 3 r’s, so the radius is 3.\nArea = πr² = π × 3² = 9π ≈ 28.27 square units.",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [
    {
      "name": "lark_tool",
      "parameters": null,
      "strict": null,
      "type": "custom",
      "description": null,
      "format": {
        "type": "grammar",
        "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/",
        "syntax": "lark"
      }
    }
  ],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": null,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "prompt_cache_key": null,
  "reasoning": {
    "effort": "medium",
    "generate_summary": null,
    "summary": null
  },
  "safety_identifier": null,
  "service_tier": "default",
  "status": "completed",
  "text": {
    "format": {
      "type": "text"
    }
  },
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 139,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 240,
    "output_tokens_details": {
      "reasoning_tokens": 192
    },
    "total_tokens": 379
  },
  "user": null,
  "content_filters": null,
  "store": true
}

聊天完成

{
  "messages": [
    {
      "role": "user",
      "content": "Which one is larger, 42 or 0?"
    }
  ],
  "tools": [
    {
      "type": "custom",
      "name": "custom_tool",
      "custom": {
        "name": "lark_tool",
        "format": {
          "type": "grammar",
          "grammar": {
            "syntax": "lark",
            "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/"
          }
        }
      }
    }
  ],
  "tool_choice": "required",
  "model": "gpt-5-2025-08-07"
}

可用性

区域可用性

型号	区域	有限的访问权限
`gpt-5.1-codex-max`	美国东部 2 和瑞典中部（全球标准）	请求访问：受限访问模型应用程序。如果已有权访问受限访问模型，则无需请求。
`gpt-5.1`	美国东部 2 和瑞典中部（全球标准和数据区域标准）	请求访问：受限访问模型应用程序。如果已有权访问受限访问模型，则无需请求。
`gpt-5.1-chat`	美国东部 2 和瑞典中部（全球标准）	不需要访问请求。
`gpt-5.1-codex`	美国东部 2 和瑞典中部（全球标准）	请求访问：受限访问模型应用程序。如果已有权访问受限访问模型，则无需请求。
`gpt-5.1-codex-mini`	美国东部 2 和瑞典中部（全球标准）	不需要访问请求。
`gpt-5-pro`	美国东部 2 和瑞典中部（全球标准）	请求访问：受限访问模型应用程序。如果已有权访问受限访问模型，则无需请求。
`gpt-5-codex`	美国东部 2 和瑞典中部（全球标准）	请求访问：受限访问模型应用程序。如果已有权访问受限访问模型，则无需请求。
`gpt-5`	模型可用性	请求访问：受限访问模型应用程序。如果已有权访问受限访问模型，则无需请求。
`gpt-5-mini`	模型可用性	不需要访问请求。
`gpt-5-nano`	模型可用性	不需要访问请求。
`o3-pro`	美国东部 2 和瑞典中部（全球标准）	请求访问：受限访问模型应用程序。如果已有权访问受限访问模型，则无需请求。
`codex-mini`	美国东部 2 和瑞典中部（全球标准）	不需要访问请求。
`o4-mini`	模型可用性	使用此模型的核心功能不需要访问请求。请求访问权限：o4 微型推理摘要功能
`o3`	模型可用性	请求访问：受限访问模型应用程序
`o3-mini`	模型可用性。	此模型不再限制访问。
`o1`	模型可用性。	此模型不再限制访问。
`o1-mini`	模型可用性。	全局标准部署不需要访问权限请求。标准（区域）部署目前仅适用于在发布`o1-preview` 版本期间被选择并授予访问权限的客户。

功能	gpt-5.1-codex-max	gpt-5.1， 2025-11-13	gpt-5.1-chat， 2025-11-13	gpt-5.1-codex， 2025-11-13	gpt-5.1-codex-mini， 2025-11-13	gpt-5-pro， 2025-10-06	gpt-5-codex， 2025-09-011	gpt-5，2025-08-07	gpt-5-mini，2025-08-07	gpt-5-nano，2025-08-07
API 版本	v1	v1	v1	v1	v1	v1	v1	v1	v1	v1
开发人员消息	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
结构化输出	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
上下文窗口	400,000 输入：272,000 输出：128,000	400,000 输入：272,000 输出：128,000	128,000 输入：111,616 输出：16,384	400,000 输入：272,000 输出：128,000	400,000 输入：272,000 输出：128,000	400,000 输入：272,000 输出：128,000	400,000 输入：272,000 输出：128,000	400,000 输入：272,000 输出：128,000	400,000 输入：272,000 输出：128,000	400,000 输入：272,000 输出：128,000
推理工作	✅ ⁶	✅ ⁴	✅	✅	✅	✅ ⁵	✅	✅	✅	✅
图像输入	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
聊天补全 API	-	✅	✅	-	-	-	-	✅	✅	✅
响应 API	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
函数/工具	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
并行工具调用¹	✅	✅	✅	✅	✅	-	✅	✅	✅	✅
`max_completion_tokens` ²	-	✅	✅	-	-	-	-	✅	✅	✅
系统消息 ³	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
推理摘要	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
流媒体	✅	✅	✅	✅	✅	-	✅	✅	✅	✅

¹当 reasoning_effort 设置为 minimal 时，不支持并行工具调用

² 推理模型仅在使用聊天完成 API 时使用 max_completion_tokens 参数。与响应 API 一起使用 max_output_tokens 。

³最新的推理模型支持系统消息，使迁移更加轻松。不应在同一 API 请求中使用开发人员消息和系统消息。

⁴gpt-5.1reasoning_effort 默认为 none. 从以前的推理模型升级到 gpt-5.1 时，请记住，如果希望触发 reasoning_effort，则可能需要更新代码以显式传递 reasoning_effort 级别。

⁵gpt-5-pro 仅支持 reasoning_efforthigh，即使未显式传递给模型，这也是默认值。

⁶gpt-5.1-codex-max 添加了对新 reasoning_effort 级别的支持，这是推理工作可以设置为的最高级别 xhigh 。

新的 GPT-5 推理功能

功能 / 特点	DESCRIPTION
`reasoning_effort`	`xhigh` 仅支持与 `gpt-5.1-codex-max` 一起使用。 `minimal` 现在支持 GPT-5 系列推理模型。^* `none` 仅支持 `gpt-5.1` 选项：`none`、、`minimal`、`lowmedium`、`high`、`xhigh`
`verbosity`	提供对模型输出的简洁性进行更加精细控制的新参数。选项：`low`、`medium`、`high`。
`preamble`	GPT-5 系列推理模型在执行函数/工具调用之前，能够花费额外的时间“思考”。在进行此类规划时，该模型能够通过一个名为 `preamble` 的新对象，为模型的响应过程中的规划步骤提供见解。在模型响应中生成前导信息并不能得到保证，但可以通过使用 `instructions` 参数并传入诸如“在每次函数调用之前，必须进行充分的规划，并且在调用任何函数之前始终向用户输出计划”之类的内容来引导模型。
允许的工具	可以在下面 `tool_choice` 指定多个工具，而不是只指定一个工具。
自定义工具类型	启用原始文本（非 json）输出
`lark_tool`	允许使用 Python lark 的某些功能来更灵活地约束模型响应

^* gpt-5-codex 不支持 reasoning_effort 最小化。

有关详细信息，我们还建议阅读 OpenAI 的 GPT-5 提示指南及 GPT-5 功能指南。

功能	codex-mini， 2025-05-16	o3-pro， 2025-06-10	o4-mini， 2025-04-16	o3， 2025-04-16	o3-mini， 2025-01-31	o1，2024年12月17日	o1-mini，2024-09-12
API 版本	`2025-04-01-preview` & v1	`2025-04-01-preview` & v1	`2025-04-01-preview` & v1	`2025-04-01-preview` & v1	`2025-04-01-preview` & v1 预览版	`2025-04-01-preview` & v1 预览版	`2025-04-01-preview` & v1 预览版
开发人员消息	✅	✅	✅	✅	✅	✅	-
结构化输出	✅	✅	✅	✅	✅	✅	-
上下文窗口	输入：200,000 输出：100000	输入：200,000 输出：100000	输入：200,000 输出：100000	输入：200,000 输出：100000	输入：200,000 输出：100000	输入：200,000 输出：100000	输入：128,000 输出：65,536
推理工作	✅	✅	✅	✅	✅	✅	-
图像输入	✅	✅	✅	✅	-	✅	-
聊天补全 API	-	-	✅	✅	✅	✅	✅
响应 API	✅	✅	✅	✅	✅	✅	-
函数/工具	✅	✅	✅	✅	✅	✅	-
并行工具调用	-	-	-	-	-	-	-
`max_completion_tokens` ¹	✅	✅	✅	✅	✅	✅	✅
系统消息 ²	✅	✅	✅	✅	✅	✅	-
推理摘要	✅	-	✅	✅	-	-	-
流式处理 ³	✅	-	✅	✅	✅	-	-

¹ 推理模型仅在使用聊天完成 API 时使用 max_completion_tokens 参数。与响应 API 一起使用 max_output_tokens 。

² 最新的 o^* 系列模型支持系统消息，以便更轻松地迁移。当您使用带有 o4-mini、o3、o3-mini 和 o1 的系统消息时，它将被视为开发人员消息。不应在同一 API 请求中使用开发人员消息和系统消息。 ³ 适用于 o3 的流式处理仅具有有限的访问权限。

注释

为避免超时，建议使用o3-pro。
o3-pro 当前不支持映像生成。

不支持

推理模型当前不支持以下各项：

temperature、top_p、presence_penalty、frequency_penalty、logprobs、top_logprobs、logit_bias、max_tokens

Markdown 输出

默认情况下，o3-mini 和 o1 模型不会尝试生成包含 markdown 格式的输出。一个常见的使用场景是，当希望模型输出包含在 markdown 代码块中的代码时，这种行为是不理想的。当模型生成不带 markdown 格式的输出时，会在交互式操场体验中丢失语法突出显示和可复制代码块等功能。若要替代此新的默认行为并鼓励在模型响应中包含 Markdown，请将字符串 Formatting re-enabled 添加到开发人员消息的开头。

将 Formatting re-enabled 添加到开发人员消息的开头不能保证模型在其响应中包含 markdown 格式，只会增加其可能性。我们在内部测试中发现，具有 Formatting re-enabled 模型的 o1 与具有 o3-mini 相比，其本身效率更低。

为了提高 Formatting re-enabled 的性能，可以进一步增强开发人员信息的开头部分，这通常会产生所需的输出结果。可以尝试添加更具描述性的初始说明，而不是将 Formatting re-enabled 添加到开发人员消息的开头，如以下示例之一：

Formatting re-enabled - please enclose code blocks with appropriate markdown tags.
Formatting re-enabled - code output should be wrapped in markdown.

根据预期输出，可能需要进一步自定义初始开发人员消息，以针对特定用例。

反馈

此页面是否有帮助？

Last updated on 2025-12-05

通过

Azure OpenAI 推理模型

用法

聊天补全 API

推理工作

开发人员消息

推理摘要

Python Lark (拉克)

响应 API

聊天完成

可用性

区域可用性

API 和功能支持

新的 GPT-5 推理功能

不支持

Markdown 输出

反馈

其他资源