你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
注释
本文档适用于 Microsoft Foundry(新)门户。
重要
本文中标记了“(预览版)”的项目目前为公共预览版。 此预览版未提供服务级别协议,不建议将其用于生产工作负载。 某些功能可能不受支持或者受限。 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款。
AI 代理是强大的生产力助手,可为业务需求创建工作流。 但是,可观测性具有挑战性,因为它们的复杂交互模式。 本文介绍如何在简单的代理数据或代理消息上本地运行内置计算器。
为了生成生产就绪的代理应用程序并确保可观测性和透明度,开发人员不仅需要工具来评估代理工作流的最终输出,还需要评估工作流的质量和效率。
类似于用户查询“明天天气”的事件会触发代理工作流。 为了生成最终响应,工作流包括通过用户意向、调用工具以及使用检索扩充生成进行推理。
在此过程中,评估工作流的每个步骤以及最终输出的质量和安全性至关重要。 我们将这些评估方面制定成代理的以下评估程序:
使用我们全面的内置评估器套件评估代理工作流的其他质量和安全方面。 通常,代理会发出代理消息。 将代理消息转换为评估员所需的正确评估数据可能很有挑战性。 如果使用 Foundry 代理服务生成代理,可以通过转换器支持无缝评估代理。 如果在 Foundry 代理服务外部构建代理,仍可以通过将代理消息解析成必要的数据格式,在代理工作流中恰当地使用我们的评估程序。 请参阅 评估其他代理的示例。
作为最佳实践,执行以下内容是至关重要的:
- 系统评估:评估代理工作流的整体质量和效率:和
- 过程评估:评估工作流中工具调用的步骤。
有关这两种做法的用例和每个代理计算器的示例的详细信息,请参阅 代理计算器 。
还可以使用 我们全面的内置计算器套件 或编写 自定义计算器来评估代理工作流的其他质量和安全方面。
如果要构建 Foundry 代理,则可以 无缝评估它们。
如果在 Foundry 外部构建代理,则仍可以通过将代理消息分析为所需的数据格式,根据代理工作流使用我们的评估程序。 请参阅 评估其他代理的详细信息。
开始
从 Azure AI 评估 SDK 中安装评估工具包:
pip install azure-ai-evaluation
从 Azure AI 评估 SDK 安装包:
pip install "azure-ai-projects>=2.0.0b1" azure-identity python-dotenv
在 .env 文件中使用您的值设置这些环境变量:
AZURE_AI_PROJECT_ENDPOINT="<your-endpoint>" # The Azure AI Project project endpoint, as found in the Home page of your Microsoft Foundry portal.
AZURE_AI_MODEL_DEPLOYMENT_NAME="<your-model-deployment-name>" # The deployment name of the AI model, as found under the "Build" page in the "Models" tab in your Foundry project.
评估 Microsoft Foundry 代理
如果您使用 Foundry 代理服务,可以通过我们对 Microsoft Foundry 代理和语义内核代理的转换器支持,轻松而流畅地评估您的代理。 以下评估器支持转换器返回的评估数据:IntentResolution、ToolCallAccuracy、TaskAdherence、Relevance和Groundedness。
注释
如果要生成输出其他架构的其他代理,请将它们转换为常规 OpenAI 样式 的代理消息架构 ,并使用前面的计算器。 更普遍地说,如果可以将代理消息分析为 所需的数据格式,还可以使用我们的所有评估程序。
AI 辅助评估器的模型支持
AzureOpenAI 和 OpenAI 推理模型和非推理模型根据评估程序支持 LLM 判断:
| 评估程序 | 推理模型作为法官(示例:Azure OpenAI/OpenAI 中的 o 系列模型) | 作为裁判的不具备推理能力的模型(例如:gpt-4.1、gpt-4o 等) | 若要为 |
|---|---|---|---|
IntentResolution、TaskAdherence、ToolCallAccuracy、ResponseCompleteness、Coherence、Fluency、Similarity、Groundedness、Retrieval、Relevance |
已支持 | 已支持 | 在初始化计算器时设置其他参数is_reasoning_model=True |
| 其他评估者 | 不支持 | 已支持 | -- |
对于需要深入推理的复杂评估,请使用如 4.1-mini 这样的强大推理模型,以在推理性能和成本效益之间取得平衡。
工具调用评估支持
ToolCallAccuracyEvaluator 支持在 Microsoft Foundry 代理中对以下工具进行评估:
- 文件搜索
- Azure AI 搜索
- 必应上下文关联
- 必应自定义搜索
- SharePoint 基础
- 代码解释器
- Fabric 数据代理
- OpenAPI
- 函数工具(用户定义的工具)
但是,如果在代理运行中使用不受支持的工具,评估程序会输出一个“通过”,并且说明评估所调用工具不受支持的原因,以便于筛选掉这种情况。 若要启用评估,请将不支持的工具包装为用户定义的工具。
此示例演示如何生成和评估 Microsoft Foundry 代理。 与评估不同,Foundry 代理服务需要 pip install azure-ai-projects azure-identity、Foundry 项目连接字符串和支持的模型。
创建代理线程并运行
代理可以使用工具。 下面是为代理创建自定义工具的示例(使用模拟天气函数作为示例):
from azure.ai.projects.models import FunctionTool, ToolSet
from typing import Set, Callable, Any
import json
# Define a custom Python function.
def fetch_weather(location: str) -> str:
"""
Fetches the weather information for the specified location.
:param location (str): The location to fetch weather for.
:return: Weather information as a JSON string.
:rtype: str
"""
# In a real-world scenario, you'd integrate with a weather API.
# In the following code snippet, we mock the response.
mock_weather_data = {"Seattle": "Sunny, 25°C", "London": "Cloudy, 18°C", "Tokyo": "Rainy, 22°C"}
weather = mock_weather_data.get(location, "Weather data not available for this location.")
weather_json = json.dumps({"weather": weather})
return weather_json
user_functions: Set[Callable[..., Any]] = {
fetch_weather,
}
# Add tools that the agent will use.
functions = FunctionTool(user_functions)
toolset = ToolSet()
toolset.add(functions)
AGENT_NAME = "Seattle Tourist Assistant"
如果使用 Foundry(非中心)项目,请使用工具集创建代理,如下所示:
注释
如果使用 基于 Foundry Hub 的项目 (仅支持较低版本的 azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10项目),我们强烈建议迁移到 最新的 Foundry 代理服务 SDK Python 客户端库 ,并将 Foundry 项目设置为日志记录批处理评估结果。
使用工具集创建代理,如下所示:
import os
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv
load_dotenv()
# Create an Azure AI Client from an endpoint, copied from your Foundry project.
# You need to login to Azure subscription via Azure CLI and set the environment variables
# Foundry project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
project_endpoint = os.environ["AZURE_AI_PROJECT"] # Ensure the PROJECT_ENDPOINT environment variable is set
# Create an AIProjectClient instance
project_client = AIProjectClient(
endpoint=project_endpoint,
credential=DefaultAzureCredential(), # Use Azure Default Credential for authentication
)
# Create an agent with the toolset
agent = project_client.agents.create_agent(
model=os.environ["MODEL_DEPLOYMENT_NAME"], # Model deployment name
name="my-agent", # Name of the agent
instructions="You are a helpful agent", # Instructions for the agent
toolset=toolset
)
print(f"Created agent, ID: {agent.id}")
# Create a thread for communication
thread = project_client.agents.threads.create()
print(f"Created thread, ID: {thread.id}")
# Add a message to the thread
message = project_client.agents.messages.create(
thread_id=thread.id,
role="user", # Role of the message sender
content="What is the weather in Seattle today?", # Message content
)
print(f"Created message, ID: {message['id']}")
# Create and process an agent run
run = project_client.agents.runs.create_and_process(thread_id=thread.id, agent_id=agent.id)
print(f"Run finished with status: {run.status}")
# Check if the run failed
if run.status == "failed":
print(f"Run failed: {run.last_error}")
# Fetch and log all messages
messages = project_client.agents.messages.list(thread_id=thread.id)
for message in messages:
print(f"Role: {message.role}, Content: {message.content}")
评估单个代理运行
创建代理运行后,可以使用转换器将 Microsoft Foundry 代理线程数据转换为评估程序可以理解的所需评估数据。
import json, os
from azure.ai.evaluation import AIAgentConverter, IntentResolutionEvaluator
# Initialize the converter for Microsoft Foundry agents.
converter = AIAgentConverter(project_client)
# Specify the thread and run ID.
thread_id = thread.id
run_id = run.id
converted_data = converter.convert(thread_id, run_id)
就是这样!
converted_data 包含这些评估工具所需的所有输入。 无需读取每个计算器的输入要求或执行任何工作来分析输入。 选择评估器并在此单次运行中调用该评估器。 我们支持 Azure OpenAI 或 OpenAI 推理模型,具体取决于评估程序的判断:
| 评估程序 | 推理模型作为法官(示例:Azure OpenAI/OpenAI 中的 o 系列模型) | 作为裁判的不具备推理能力的模型(例如:gpt-4.1、gpt-4o 等) | 若要为 |
|---|---|---|---|
所有质量评估程序,GroundednessProEvaluator 除外 |
已支持 | 已支持 | 在初始化计算器时设置其他参数is_reasoning_model=True |
GroundednessProEvaluator |
用户不需要支持模型 | 用户不需要支持模型 | -- |
对于需要精细推理能力进行评估的复杂任务,我们建议使用强大的推理模型,例如 o3-mini 或此后发布的 o 系列微型模型,以在推理性能和成本效益之间取得平衡。
我们在quality_evaluators和safety_evaluators中设置了质量和安全评估者列表,并在评估多个代理运行或线程时引用它们。
# This is specific to agentic workflows.
from azure.ai.evaluation import IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator
# Other quality, risk, and safety metrics:
from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator, CodeVulnerabilityEvaluator, ContentSafetyEvaluator, IndirectAttackEvaluator, FluencyEvaluator
from azure.identity import DefaultAzureCredential
import os
from dotenv import load_dotenv
load_dotenv()
model_config = {
"azure_deployment": os.getenv("AZURE_DEPLOYMENT_NAME"),
"api_key": os.getenv("AZURE_API_KEY"),
"azure_endpoint": os.getenv("AZURE_ENDPOINT"),
"api_version": os.getenv("AZURE_API_VERSION"),
}
# example config for a reasoning model
reasoning_model_config = {
"azure_deployment": "o3-mini",
"api_key": os.getenv("AZURE_API_KEY"),
"azure_endpoint": os.getenv("AZURE_ENDPOINT"),
"api_version": os.getenv("AZURE_API_VERSION"),
}
# Evaluators you might want to use with reasoning models
quality_evaluators = {evaluator.__name__: evaluator(model_config=reasoning_model_config, is_reasoning_model=True) for evaluator in [IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator]}
# Other evaluators you might NOT want to use with reasoning models
quality_evaluators.update({ evaluator.__name__: evaluator(model_config=model_config) for evaluator in [CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator]})
## Using Foundry (non-Hub) project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
azure_ai_project = os.environ.get("AZURE_AI_PROJECT")
safety_evaluators = {evaluator.__name__: evaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()) for evaluator in [ContentSafetyEvaluator, IndirectAttackEvaluator, CodeVulnerabilityEvaluator]}
# Reference the quality and safety evaluator list above.
quality_and_safety_evaluators = {**quality_evaluators, **safety_evaluators}
for name, evaluator in quality_and_safety_evaluators.items():
result = evaluator(**converted_data)
print(name)
print(json.dumps(result, indent=4))
输出格式
AI 辅助的质量评估程序返回查询和响应对的结果。 结果是一个字典,其中包括:
-
{metric_name}:提供一个数字分数,采用 likert 量表(整数 1 到 5)或 0-1 之间的浮点数。 -
{metric_name}_label:提供二进制标签(如果指标自然输出二进制分数)。 -
{metric_name}_reason:说明为何为每个数据点提供特定分数或标签。 -
details:包含有关单个代理运行质量的调试信息的可选输出。
为了提高清晰度,所有计算器都接受二进制阈值(除非其输出已是二进制),并生成两个新键。 对于二进制化阈值,将设置默认值,用户可以替代该阈值。 这两个新密钥包括:
-
{metric_name}_result:基于二进制化阈值的“pass”或“fail”字符串。 -
{metric_name}_threshold:默认或用户设置的数字二进制化阈值。
请参阅以下一些评估器的示例输出:
{
"intent_resolution": 5.0, # likert scale: 1-5 integer
"intent_resolution_threshold": 3,
"intent_resolution_result": "pass", # pass because 5 > 3 the threshold
"intent_resolution_reason": "The assistant correctly understood the user's request to fetch the weather in Seattle. It used the appropriate tool to get the weather information and provided a clear and accurate response with the current weather conditions in Seattle. The response fully resolves the user's query with all necessary information."
}
{
"task_adherence": 5.0, # likert scale: 1-5 integer
"task_adherence_threshold": 3,
"task_adherence_result": "pass", # pass because 5 > 3 the threshold
"task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
}
{
"tool_call_accuracy": 5, # a score between 1-5, higher is better
"tool_call_accuracy_threshold": 3,
"tool_call_accuracy_result": "pass", # pass because 5 > 3 the threshold
"details": { ... } # helpful details for debugging the tool calls made by the agent
}
评估多个代理运行或线程
若要评估多个代理运行或线程,请使用批处理 evaluate() API 进行异步评估。 首先,使用转换器支持将代理线程数据转换为文件:
import json
from azure.ai.evaluation import AIAgentConverter
# Initialize the converter.
converter = AIAgentConverter(project_client)
# Specify a file path to save the agent output (evaluation input data) to.
filename = os.path.join(os.getcwd(), "evaluation_input_data.jsonl")
evaluation_data = converter.prepare_evaluation_data(thread_ids=thread_id, filename=filename)
print(f"Evaluation data saved to {filename}")
在一行代码中准备好评估数据后,可以选择评估者来评估代理质量并提交批量评估运行。 在以下示例中,我们在 “评估单个代理运行quality_and_safety_evaluators”部分中引用了相同的质量和安全评估程序列表:
import os
from dotenv import load_dotenv
load_dotenv()
# Batch evaluation API (local):
from azure.ai.evaluation import evaluate
response = evaluate(
data=filename,
evaluation_name="agent demo - batch run",
evaluators=quality_and_safety_evaluators,
# optionally, log your results to your Foundry project for rich visualization
azure_ai_project=os.environ.get("AZURE_AI_PROJECT"), # example: https://your-account.services.ai.azure.com/api/projects/your-project
)
# Inspect the average scores at a high level.
print(response["metrics"])
# Use the URL to inspect the results on the UI.
print(f'Foundry URL: {response.get("studio_url")}')
选择 URL 后,会重定向到 Foundry。 在 Foundry 项目中查看评估结果并调试应用程序。 使用原因字段和通过/失败结果来评估应用程序的质量和安全性能。 可以运行和比较多个运行以测试回归或改进。
使用 Azure AI 评估 SDK 客户端库,通过支持转换器来评估 Microsoft Foundry 代理,从而提升对代理工作流的可观测性和透明度。
可以在代理评估器和RAG 评估器中使用评估器无缝评估 Foundry 代理。 本部分将指导你创建代理并对其进行评估。
注释
如果要生成输出其他架构的其他代理,请将它们转换为常规 OpenAI 样式 的代理消息架构 ,并使用前面的计算器。 更普遍地说,如果可以将代理消息分析为所需的数据格式,还可以使用我们的所有评估程序。
先决条件
import json
from azure.ai.projects.models import Tool, FunctionTool
# Define a function tool for the model to use
func_tool = fetch_weather(
name="fetch_weather",
parameters={
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to fetch weather for.",
},
},
"required": ["location"],
"additionalProperties": False,
},
description="Get the current weather for a location.",
strict=True,
)
tools: list[Tool] = [func_tool]
# Define a custom Python function.
async def fetch_weather(location: str) -> str:
"""
Fetches the weather information for the specified location.
:param location (str): The location to fetch weather for.
:return: Weather information as a JSON string.
:rtype: str
"""
# In a real-world scenario, you'd integrate with a weather API.
# In the following code snippet, we mock the response.
mock_weather_data = {"Seattle": "Sunny, 25°C", "London": "Cloudy, 18°C", "Tokyo": "Rainy, 22°C"}
weather = mock_weather_data.get(location, "Weather data not available for this location.")
weather_json = json.dumps({"weather": weather})
return weather_json
使用工具集设置代理,并创建用于评估的响应测试。
使用工具集创建代理,如下所示:
import os
import json
from dotenv import load_dotenv
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import PromptAgentDefinition
from azure.identity import DefaultAzureCredential
from openai.types.responses.response_input_param import FunctionCallOutput, ResponseInputParam
credential = DefaultAzureCredential()
project_client = AIProjectClient(
endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
credential=DefaultAzureCredential(),
)
with project_client:
openai_client = project_client.get_openai_client()
agent = await project_client.agents.create_version(
agent_name="MyAgent",
definition=PromptAgentDefinition(
model=os.environ["AZURE_AI_MODEL_DEPLOYMENT_NAME"],
instructions="You are a helpful assistant that can use function tools.",
tools=tools,
),
)
print(f"Agent created (id: {agent.id}, name: {agent.name}, version: {agent.version})")
conversation = openai_client.conversations.create(
items=[{"type": "message", "role": "user", "content": "What is the weather in Seattle?"}],
)
print(f"Created conversation with initial user message (id: {conversation.id})")
response = openai_client.responses.create(
conversation=conversation.id,
extra_body={"agent": {"name": agent.name, "type": "agent_reference"}},
)
print(f"Response output: {response.output_text} (id: {response.id})")
# Now create evaluation for the response
data_source_config = {"type": "azure_ai_source", "scenario": "responses"}
# add your desired evaluators here
testing_criteria = [
{"type": "azure_ai_evaluator", "name": "task_adherence", "evaluator_name": "builtin.task_adherence"},
{"type": "azure_ai_evaluator", "name": "groundedness", "evaluator_name": "builtin.groundedness"},
]
eval_object = openai_client.evals.create(
name="Agent Response Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
print(f"Evaluation created (id: {eval_object.id}, name: {eval_object.name})")
data_source = {
"type": "azure_ai_responses",
"item_generation_params": {
"type": "response_retrieval",
"data_mapping": {"response_id": "{{item.resp_id}}"},
"source": {"type": "file_content", "content": [{"item": {"resp_id": response.id}}]},
},
}
response_eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id, name=f"Evaluation Run for Agent {agent.name}", data_source=data_source
)
print(f"Evaluation run created (id: {response_eval_run.id})")
while response_eval_run.status not in ["completed", "failed"]:
response_eval_run = openai_client.evals.runs.retrieve(run_id=response_eval_run.id, eval_id=eval_object.id)
print(f"Waiting for eval run to complete... current status: {response_eval_run.status}")
time.sleep(5)
if response_eval_run.status == "completed":
print("\n✓ Evaluation run completed successfully!")
print(f"Result Counts: {response_eval_run.result_counts}")
print(f"Eval Run Report URL: {response_eval_run.report_url}")
output_items = list(
openai_client.evals.runs.output_items.list(run_id=response_eval_run.id, eval_id=eval_object.id)
)
print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
print(f"{'-'*60}")
pprint(output_items)
print(f"{'-'*60}")
else:
print("\n✗ Evaluation run failed.")
结果的解释
对于单个数据示例,所有计算器始终输出以下架构:
- 标签:一个“通过”或“失败”的标签,类似于单元测试的输出。 使用此结果可促进评估器之间的比较。
- 评分:每个评估程序自然刻度上的分数。 一些评估者使用细粒度的标尺,在 5 分级(质量评估器)或 7 分级(内容安全评估器)评分。 其他(如文本相似性计算器)使用 F1 分数,这些分数在 0 和 1 之间浮动。 “标签”字段根据“阈值”将任何非二进制“分数”二进制化为“通过”或“不通过”。
- 阈值:任何非二元分数都依据默认阈值进行二值化为“通过”或“不通过”,可以在 SDK 体验中重新设定这个阈值。
- 原因:为了提高可理解性,所有 LLM 法官评估器也输出一个推理字段,以解释为什么给出特定分数。
- 详细信息:(可选)对于某些计算器(如tool_call_accuracy),可能有一个“详细信息”字段或标志,其中包含其他信息以帮助用户调试其应用程序。 对于多个数据示例(数据集)的聚合结果,具有“通过”的示例的平均通过率构成该数据集的通过率。
在 URL 之后,你将被重定向到 Foundry。 可以在 Foundry 项目中查看评估结果并调试应用程序。 使用“原因”字段以及通过/未通过来评估应用程序的质量和安全性能。 可以运行和比较多个运行以测试回归或改进。
使用 Microsoft Foundry SDK Python 客户端库评估 Microsoft Foundry 代理,从而在代理工作流中实现可观测性和透明度。
评估其他代理
如果使用代理服务外部的代理,仍可以通过为所选评估程序准备正确的数据来评估它们。
代理通常发出消息以与用户或其他代理交互。 内置评估程序在 query、response、ground_truth 中接受简单的数据类型,如字符串,基于单轮数据输入要求。 但是,由于代理和框架差异的复杂交互模式,从代理消息中提取这些简单数据类型可能会很困难。 例如,单个用户查询可以触发一长串代理消息,通常会调用多个工具。
代理通常发出消息以与用户或其他代理交互。 我们的内置评估程序可在 query、response、ground_truth 中接受简单的数据类型,如字符串,基于单轮数据输入要求。 但是,由于代理和框架差异的复杂交互模式,从代理消息中提取这些简单数据类型可能很困难。 例如,单个用户查询可以触发一长串代理消息,通常会调用多个工具。
如以下示例所示,我们为以下内置计算器启用代理消息支持,以评估代理工作流的这些方面。 在评估代理时,这些计算器可能会采用 tool_calls 或 tool_definitions 作为代理唯一的参数。
| 计算器 | query |
response |
tool_calls |
tool_definitions |
|---|---|---|---|---|
IntentResolutionEvaluator |
必需:Union[str, list[Message]] |
必需:Union[str, list[Message]] |
不适用 | 自选: list[ToolCall] |
ToolCallAccuracyEvaluator |
必需:Union[str, list[Message]] |
自选: Union[str, list[Message]] |
自选: Union[dict, list[ToolCall]] |
必需:list[ToolDefinition] |
TaskAdherenceEvaluator |
必需:Union[str, list[Message]] |
必需:Union[str, list[Message]] |
不适用 | 自选: list[ToolCall] |
GroundednessEvaluator |
必需:Union[str, list[Message]] |
必需:Union[str, list[Message]] |
不适用 | 必需:list[ToolCall] |
-
Message:dict描述代理与用户的交互的 OpenAI 样式消息,其中query必须包含系统消息作为第一条消息。 -
ToolCall:dict用于指定代理与用户交互时调用的工具。 -
ToolDefinition:dict描述代理可用的工具。
对 ToolCallAccuracyEvaluator,您必须提供 response 或 tool_calls。
GroundednessEvaluator 要求 tool_definitions 评估代理响应是否有根据,以代理接收到的工具输出为依据。
以下示例显示了两种数据格式:简单的代理数据和代理消息。 但是,由于这些计算器的独特要求,我们建议参考 示例笔记本,其中说明了每个计算器的可能输入路径。
与其他 内置 AI 辅助质量评估器一样, IntentResolutionEvaluator 并 TaskAdherenceEvaluator 输出 Likert 分数(整数 1-5;分数更高)。
ToolCallAccuracyEvaluator 根据用户查询输出所有工具调用的传递速率(介于 0 到 1 之间的浮点数)。 为了进一步提高可理解性,所有计算器都接受二进制阈值并输出两个新密钥。 对于二进制化阈值,将设置默认值,你可以替代它。 这两个新密钥包括:
-
{metric_name}_result:基于二进制化阈值的“pass”或“fail”字符串。 -
{metric_name}_threshold:默认或用户设置的数字二进制化阈值。
简单代理数据
采用简单的代理数据格式, query 并且 response 是简单的 Python 字符串。 例如:
import os
import json
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import IntentResolutionEvaluator, ResponseCompletenessEvaluator
model_config = AzureOpenAIModelConfiguration(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version=os.environ["AZURE_OPENAI_API_VERSION"],
azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)
intent_resolution_evaluator = IntentResolutionEvaluator(model_config)
# Evaluate the query and response as strings.
# The following is a positive example. Intent is identified and understood and the response correctly resolves user intent.
result = intent_resolution_evaluator(
query="What are the opening hours of the Eiffel Tower?",
response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM.",
)
print(json.dumps(result, indent=4))
请参阅以下输出(有关详细信息,请参阅 输出格式 ):
{
"intent_resolution": 5.0,
"intent_resolution_result": "pass",
"intent_resolution_threshold": 3,
"intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower, which directly addresses the user's query. The information is clear, accurate, and complete, fully resolving the user's intent.",
}
代理工具调用和定义
请参阅tool_calls的以下tool_definitions和ToolCallAccuracyEvaluator示例:
import json
query = "How is the weather in Seattle?"
tool_calls = [{
"type": "tool_call",
"tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
"name": "fetch_weather",
"arguments": {
"location": "Seattle"
}
},
{
"type": "tool_call",
"tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
"name": "fetch_weather",
"arguments": {
"location": "London"
}
}]
tool_definitions = [{
"name": "fetch_weather",
"description": "Fetches the weather information for the specified location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to fetch weather for."
}
}
}
}]
from azure.ai.evaluation import ToolCallAccuracyEvaluator
tool_call_accuracy = ToolCallAccuracyEvaluator(model_config) # reuse the config defined above
response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definitions)
print(json.dumps(response, indent=4))
请参阅以下输出(有关详细信息,请参阅 输出格式 ):
{
"tool_call_accuracy": 3, # a score between 1-5, higher is better
"tool_call_accuracy_result": "fail",
"tool_call_accuracy_threshold": 4,
"details": { ... } # helpful details for debugging the tool calls made by the agent
}
请参阅tool_calls的以下tool_definitions和tool_call_accuracy示例:
import json
query = "How is the weather in Seattle?"
tool_calls = [{
"type": "tool_call",
"tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
"name": "fetch_weather",
"arguments": {
"location": "Seattle"
}
},
{
"type": "tool_call",
"tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
"name": "fetch_weather",
"arguments": {
"location": "London"
}
}]
tool_definitions = [{
"name": "fetch_weather",
"description": "Fetches the weather information for the specified location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to fetch weather for."
}
}
}
}]
代理消息架构
在代理消息格式中,query和response是 OpenAI 样式消息的列表。 具体而言,query 包含导致最后一个用户查询的过去代理用户交互,并要求在列表顶部显示(代理的)系统消息。
response 传递代理的最后一条消息,以响应最后一个用户查询。
评估器的预期输入格式是一个 Python 消息列表,如下所示:
[
{
"role": "system" | "user" | "assistant" | "tool",
"createdAt": "ISO 8601 timestamp", // Optional for 'system'
"run_id": "string", // Optional, only for assistant/tool in tool call context
"tool_call_id": "string", // Optional, only for tool/tool_result
"name": "string", // Present if it's a tool call
"arguments": { ... }, // Parameters passed to the tool (if tool call)
"content": [
{
"type": "text" | "tool_call" | "tool_result",
"text": "string", // if type == text
"tool_call_id": "string", // if type == tool_call
"name": "string", // tool name if type == tool_call
"arguments": { ... }, // tool args if type == tool_call
"tool_result": { ... } // result if type == tool_result
}
]
}
]
示例查询和响应对象:
query = [
{
"role": "system",
"content": "You are an AI assistant interacting with Azure Maps services to serve user requests."
},
{
"createdAt": "2025-04-25T23:55:43Z",
"role": "user",
"content": [
{
"type": "text",
"text": "Find the address for coordinates 41.8781,-87.6298."
}
]
},
{
"createdAt": "2025-04-25T23:55:45Z",
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
"role": "assistant",
"content": [
{
"type": "tool_call",
"tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
"name": "azure_maps_reverse_address_search",
"arguments": {
"lat": "41.8781",
"lon": "-87.6298"
}
}
]
},
{
"createdAt": "2025-04-25T23:55:47Z",
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
"tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
"role": "tool",
"content": [
{
"type": "tool_result",
"tool_result": {
"address": "300 South Federal Street, Chicago, IL 60604",
"position": {
"lat": "41.8781",
"lon": "-87.6298"
}
}
}
]
},
{
"createdAt": "2025-04-25T23:55:48Z",
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
"role": "assistant",
"content": [
{
"type": "text",
"text": "The address for the coordinates 41.8781, -87.6298 is 300 South Federal Street, Chicago, IL 60604."
}
]
},
{
"createdAt": "2025-04-25T23:55:50Z",
"role": "user",
"content": [
{
"type": "text",
"text": "What timezone corresponds to 41.8781,-87.6298?"
}
]
},
]
response = [
{
"createdAt": "2025-04-25T23:55:52Z",
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
"role": "assistant",
"content": [
{
"type": "tool_call",
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
"name": "azure_maps_timezone",
"arguments": {
"lat": 41.878100000000003,
"lon": -87.629800000000003
}
}
]
},
{
"createdAt": "2025-04-25T23:55:54Z",
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
"role": "tool",
"content": [
{
"type": "tool_result",
"tool_result": {
"ianaId": "America/Chicago",
"utcOffset": None,
"abbreviation": None,
"isDaylightSavingTime": None
}
}
]
},
{
"createdAt": "2025-04-25T23:55:55Z",
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
"role": "assistant",
"content": [
{
"type": "text",
"text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
}
]
}
]
注释
如果查询(到当前运行的会话历史记录)或代理响应(对查询的回应)未采用预期格式,评估器会显示警告。
请参阅使用 ToolCallAccuracyEvaluator 评估代理消息的示例:
import json
# The user asked a question.
query = [
{
"role": "system",
"content": "You are a friendly and helpful customer service agent."
},
# Past interactions are omitted.
# ...
{
"createdAt": "2025-03-14T06:14:20Z",
"role": "user",
"content": [
{
"type": "text",
"text": "Hi, I need help with the last 2 orders on my account #888. Could you please update me on their status?"
}
]
}
]
# The agent emits multiple messages to fulfill the request.
response = [
{
"createdAt": "2025-03-14T06:14:30Z",
"run_id": "0",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! Let me quickly look up your account details."
}
]
},
{
"createdAt": "2025-03-14T06:14:35Z",
"run_id": "0",
"role": "assistant",
"content": [
{
"type": "tool_call",
"tool_call_id": "tool_call_20250310_001",
"name": "get_orders",
"arguments": {
"account_number": "888"
}
}
]
},
# Many more messages are omitted.
# ...
# Here is the agent's final response:
{
"createdAt": "2025-03-14T06:15:05Z",
"run_id": "0",
"role": "assistant",
"content": [
{
"type": "text",
"text": "The order with ID 123 has been shipped and is expected to be delivered on March 15, 2025. However, the order with ID 124 is delayed and should now arrive by March 20, 2025. Is there anything else I can help you with?"
}
]
}
]
# An example of tool definitions available to the agent:
tool_definitions = [
{
"name": "get_orders",
"description": "Get the list of orders for a given account number.",
"parameters": {
"type": "object",
"properties": {
"account_number": {
"type": "string",
"description": "The account number to get the orders for."
}
}
}
},
# Other tool definitions are omitted.
# ...
]
result = tool_call_accuracy(
query=query,
response=response,
tool_definitions=tool_definitions
)
print(json.dumps(result, indent=4))
请参阅以下输出(有关详细信息,请参阅 输出格式 ):
{
"tool_call_accuracy": 2, # a score between 1-5, higher is better
"tool_call_accuracy_result": "fail",
"tool_call_accuracy_threshold": 3,
"details": { ... } # helpful details for debugging the tool calls made by the agent
}
query = [
{
"role": "system",
"content": "You are an AI assistant interacting with Azure Maps services to serve user requests."
},
{
"createdAt": "2025-04-25T23:55:43Z",
"role": "user",
"content": [
{
"type": "text",
"text": "Find the address for coordinates 41.8781,-87.6298."
}
]
},
{
"createdAt": "2025-04-25T23:55:45Z",
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
"role": "assistant",
"content": [
{
"type": "tool_call",
"tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
"name": "azure_maps_reverse_address_search",
"arguments": {
"lat": "41.8781",
"lon": "-87.6298"
}
}
]
},
{
"createdAt": "2025-04-25T23:55:47Z",
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
"tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
"role": "tool",
"content": [
{
"type": "tool_result",
"tool_result": {
"address": "300 South Federal Street, Chicago, IL 60604",
"position": {
"lat": "41.8781",
"lon": "-87.6298"
}
}
}
]
},
{
"createdAt": "2025-04-25T23:55:48Z",
"run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
"role": "assistant",
"content": [
{
"type": "text",
"text": "The address for the coordinates 41.8781, -87.6298 is 300 South Federal Street, Chicago, IL 60604."
}
]
},
{
"createdAt": "2025-04-25T23:55:50Z",
"role": "user",
"content": [
{
"type": "text",
"text": "What timezone corresponds to 41.8781,-87.6298?"
}
]
},
]
response = [
{
"createdAt": "2025-04-25T23:55:52Z",
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
"role": "assistant",
"content": [
{
"type": "tool_call",
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
"name": "azure_maps_timezone",
"arguments": {
"lat": 41.878100000000003,
"lon": -87.629800000000003
}
}
]
},
{
"createdAt": "2025-04-25T23:55:54Z",
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
"role": "tool",
"content": [
{
"type": "tool_result",
"tool_result": {
"ianaId": "America/Chicago",
"utcOffset": None,
"abbreviation": None,
"isDaylightSavingTime": None
}
}
]
},
{
"createdAt": "2025-04-25T23:55:55Z",
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
"role": "assistant",
"content": [
{
"type": "text",
"text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
}
]
}
]
注释
当查询(会话历史记录至当前运行)或代理响应(对查询的响应)的格式不是预期的格式时,评估器会发出警告,指出无法解析这些内容。
代理消息的更多示例:
import json
# The user asked a question.
query = [
{
"role": "system",
"content": "You are a friendly and helpful customer service agent."
},
# Past interactions are omitted.
# ...
{
"createdAt": "2025-03-14T06:14:20Z",
"role": "user",
"content": [
{
"type": "text",
"text": "Hi, I need help with the last 2 orders on my account #888. Could you please update me on their status?"
}
]
}
]
# The agent emits multiple messages to fulfill the request.
response = [
{
"createdAt": "2025-03-14T06:14:30Z",
"run_id": "0",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! Let me quickly look up your account details."
}
]
},
{
"createdAt": "2025-03-14T06:14:35Z",
"run_id": "0",
"role": "assistant",
"content": [
{
"type": "tool_call",
"tool_call_id": "tool_call_20250310_001",
"name": "get_orders",
"arguments": {
"account_number": "888"
}
}
]
},
# Many more messages are omitted.
# ...
# Here is the agent's final response:
{
"createdAt": "2025-03-14T06:15:05Z",
"run_id": "0",
"role": "assistant",
"content": [
{
"type": "text",
"text": "The order with ID 123 has been shipped and is expected to be delivered on March 15, 2025. However, the order with ID 124 is delayed and should now arrive by March 20, 2025. Is there anything else I can help you with?"
}
]
}
]
# An example of tool definitions available to the agent:
tool_definitions = [
{
"name": "get_orders",
"description": "Get the list of orders for a given account number.",
"parameters": {
"type": "object",
"properties": {
"account_number": {
"type": "string",
"description": "The account number to get the orders for."
}
}
}
},
# Other tool definitions are omitted.
# ...
]
此评估架构有助于分析代理服务外部的代理数据,使内置评估程序能够支持代理工作流中的可观测性。
示例笔记本
针对以下每个评估程序尝试一个示例:
在示例存储库中为每个评估工具尝试一个示例。