你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

Azure OpenAI 评分者(预览版)

注释

本文档引用 Microsoft Foundry (经典) 门户。

🔄如果使用新门户,请切换到 Microsoft Foundry (new) 文档

注释

本文档指 Microsoft Foundry (新) 门户。

Azure OpenAI 分级师是 Microsoft Foundry SDK 中一组新的评估工具,用于评估 AI 模型及其输出的性能。 这些评分员包括:

Name 类型 它的作用是什么
label_grader label_model 使用 LLM 将情绪分类为 正面中性负面 情绪。
text_check_grader text_similarity 使用BLEU分数比较相似性的基础真相和响应。
string_check_grader string_check 在两个值之间执行字符串相等性检查。
score score_model 根据语义和结构比较分配相似性分数(1-5)。

可以在本地或远程运行评分员。 每个评分员都会评估 AI 模型的特定方面及其输出。

重要

本文中标记了“(预览版)”的项目目前为公共预览版。 此预览版未提供服务级别协议,不建议将其用于生产工作负载。 某些功能可能不受支持或者受限。 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款

AI 辅助评分器的模型配置

以下代码片段显示了 AI 辅助评分器使用的模型配置:

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION")
)

标签分级器

AzureOpenAILabelGrader 使用自定义提示指示模型根据定义的标签对输出进行分类。 它返回结构化结果,并说明选择每个标签的原因。

注释

建议使用 Azure OpenAI o3-mini 获得最佳结果。

下面是以下代码片段中使用的示例 data.jsonl

[
    {
        "query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
        "ground_truth": "Choosing an in-network provider helps you save money and ensures better, more personalized care. [Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer better coverage, and support continuity of care, leading to more effective and personalized treatment. [Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What should you do when choosing an in-network provider for your health care needs?",
        "ground_truth": "Check with Northwind Health Plus to confirm the provider is in-network, as this helps reduce costs.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "To choose an in-network provider, confirm they are part of your plan using the Northwind website or by calling their office. Also consider location, availability, and office hours. [Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What range of in-network providers does Northwind Health Plus offer?",
        "ground_truth": "Northwind Health Plus offers a wide range of in-network providers, including doctors, specialists, hospitals, and pharmacies.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus offers a wide range of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]. The exact range of in-network providers may vary depending on the area, but the Northwind Health Plus website provides a searchable directory of all in-network providers in your area [Northwind_Health_Plus_Benefits_Details.pdf]. It is important to check with Northwind Health Plus to ensure that the provider you are considering is in-network [Northwind_Health_Plus_Benefits_Details.pdf]."
    },
    {
        "query": "What is the limitation of in-network providers?",
        "ground_truth": "The limitation of in-network providers is that they may not always be available in every area.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "The limitation of in-network providers is that they may not accept the amount of payment offered by Northwind Health, which means you may be responsible for a greater portion of the cost [Northwind_Standard_Benefits_Details.pdf]. Additionally, out-of-network providers may not offer additional services or discounts that are available to in-network providers [Northwind_Standard_Benefits_Details.pdf]. It is recommended to choose an in-network provider whenever possible to ensure the best value for your health care expenses [Northwind_Standard_Benefits_Details.pdf].\n\nSources:\n- Northwind_Standard_Benefits_Details.pdf"
    },
    {
        "query": "What resource does Northwind Health Plus provide to find in-network providers in your area?",
        "ground_truth": "The Northwind Health Plus website offers a searchable directory of all in-network providers in your area. This directory is regularly updated, so you can be sure that you are choosing from in-network providers that are available.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus provides a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]."
    }
]

标签分级程序示例

from azure.ai.evaluation import AzureOpenAILabelGrader, evaluate

data_file_name="data.jsonl"

#  Evaluation criteria: Determine if the response column contains text that is "too short," "just right," or "too long," and pass if it is "just right."
label_grader = AzureOpenAILabelGrader(
    model_config=model_config,
    input=[{"content": "{{item.response}}", "role": "user"},
           {"content": "Any text including space that's more than 600 characters is too long, less than 500 characters is too short; 500 to 600 characters is just right.", "role": "user", "type": "message"}],
    labels=["too short", "just right", "too long"],
    passing_labels=["just right"],
    model="gpt-4o",
    name="label",
)

label_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "label": label_grader
    },
)

标签分级器输出

对于数据文件中的每个示例数据集,返回或True返回评估结果False,表示输出是否与定义的传递标签匹配。 适用于score1.0True事例和0.0False事例。 模型为数据提供标签的原因。contentoutputs.label.sample

'outputs.label.sample':
...
...
    'output': [{'role': 'assistant',
      'content': '{"steps":[{"description":"Calculate the number of characters in the user\'s input including spaces.","conclusion":"The provided text contains 575 characters."},{"description":"Evaluate if the character count falls within the given ranges (greater than 600 too long, less than 500 too short, 500 to 600 just right).","conclusion":"The character count falls between 500 and 600, categorized as \'just right.\'"}],"result":"just right"}'}],
...
...
'outputs.label.passed': True,
'outputs.label.score': 1.0

除单个数据评估结果外,评分器还返回指示数据集总通过率的指标。

'metrics': {'label.pass_rate': 0.2}, #1/5 in this case

字符串检查器

将输入文本与引用值进行比较,检查是否与可选的不区分大小写的完全匹配项或部分匹配项。 适用于灵活的文本验证和模式匹配。

字符串检查器示例

from azure.ai.evaluation import AzureOpenAIStringCheckGrader

# Evaluation criteria: Pass if the query column contains "What is"
string_grader = AzureOpenAIStringCheckGrader(
    model_config=model_config,
    input="{{item.query}}",
    name="starts with what is",
    operation="like", # "eq" for equal, "ne" for not equal, "like" for contains, "ilike" for case-insensitive contains
    reference="What is",
)

string_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "string": string_grader
    },
)

字符串检查器输出

对于数据文件中的每个示例数据集,返回或返回评估结果TrueFalse,指示输入文本是否与定义的模式匹配规则匹配。 对于score事例1.0True,则score0.0False案例。

'outputs.string.passed': True,
'outputs.string.score': 1.0

评分器还返回指示总体数据集通过率的指标。

'metrics': {'string.pass_rate': 0.4}, # 2/5 in this case

文本相似性

使用相似性指标(如fuzzy_matchBLEUROUGEMETEOR)评估接近输入文本与引用值匹配的方式。 这对于评估文本质量或语义接近性非常有用。

文本相似性示例

from azure.ai.evaluation import AzureOpenAITextSimilarityGrader

# Evaluation criteria: Pass if response column and ground_truth column similarity score >= 0.5 using "fuzzy_match"
sim_grader = AzureOpenAITextSimilarityGrader(
    model_config=model_config,
    evaluation_metric="fuzzy_match", # support evaluation metrics including: "fuzzy_match", "bleu", "gleu", "meteor", "rouge_1", "rouge_2", "rouge_3", "rouge_4", "rouge_5", "rouge_l", "cosine",
    input="{{item.response}}",
    name="similarity",
    pass_threshold=0.5,
    reference="{{item.ground_truth}}",
)

sim_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "similarity": sim_grader
    },
)
sim_grader_evaluation

文本相似性输出

对于数据文件中的每个示例数据集,将生成数字相似性分数。 此分数范围为 0 到 1,表示相似度,分数越高表示相似性越高。 评估结果 TrueFalse 也返回,表示相似性分数是否符合或超出在评分器中定义的评估指标指定的阈值。

'outputs.similarity.passed': True,
'outputs.similarity.score': 0.6117136659436009

评分器还返回指示总体数据集通过率的指标。

'metrics': {'similarity.pass_rate': 0.4}, # 2 out of 5 in this case

Python 分级程序

高级用户可以创建或导入自定义 Python 分级程序函数,并将其集成到 Azure OpenAI Python 评分器中。 这使评估能够根据现有 Azure OpenAI 评分员的功能以外的特定兴趣领域进行定制。 以下示例演示如何导入自定义相似性评分程序函数,并将其配置为使用 Microsoft Foundry SDK 作为 Azure OpenAI Python 分级程序运行。

Python 分级程序示例

from azure.ai.evaluation import AzureOpenAIPythonGrader

python_similarity_grader = AzureOpenAIPythonGrader(
    model_config=model_config_aoai,
    name="custom_similarity",
    image_tag="2025-05-08",
    pass_threshold=0.3,
    source="""
    def grade(sample, item) -> float:
     \"\"\"
     Custom similarity grader using word overlap.
     Note: All data is in the 'item' parameter.
     \"\"\"
     # Extract from item, not sample!
     response = item.get("response", "") if isinstance(item, dict) else ""
     ground_truth = item.get("ground_truth", "") if isinstance(item, dict) else ""

     # Simple word overlap similarity
     response_words = set(response.lower().split())
     truth_words = set(ground_truth.lower().split())

     if not truth_words:
     return 0.0

     overlap = response_words.intersection(truth_words)
     similarity = len(overlap) / len(truth_words)

     return min(1.0, similarity)
""",
)

file_name = "eval_this.jsonl"
evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "custom_similarity": python_similarity_grader,
    },
    #azure_ai_project=azure_ai_project,
)
evaluation

输出

对于数据文件中的每个示例数据集,Python 评分器基于定义的函数返回数值分数。 给定定义为自定义评分器的一部分的数字阈值,我们还在分数 True= 阈值时>输出,否则也会输出False

例如:

"outputs.custom_similarity.passed": false,
"outputs.custom_similarity.score": 0.0

除了单个数据评估结果之外,评分器还返回一个指示总体数据集通过率的指标。

'metrics': {'custom_similarity.pass_rate': 0.0}, #0/5 in this case

示例:

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    DatasetVersion,
)
import json
import time
import os
from pprint import pprint
from openai.types.evals.create_eval_jsonl_run_data_source_param import CreateEvalJSONLRunDataSourceParam, SourceFileID
from dotenv import load_dotenv
from datetime import datetime


load_dotenv()

endpoint = os.environ[
    "AZURE_AI_PROJECT_ENDPOINT"
]  # Sample : https://<account_name>.services.ai.azure.com/api/projects/<project_name>

connection_name = os.environ.get("CONNECTION_NAME", "")
model_endpoint = os.environ.get("MODEL_ENDPOINT", "")  # Sample: https://<account_name>.openai.azure.com.
model_api_key = os.environ.get("MODEL_API_KEY", "")
model_deployment_name = os.environ.get("MODEL_DEPLOYMENT_NAME", "")  # Sample : gpt-4o-mini
dataset_name = os.environ.get("DATASET_NAME", "")
dataset_version = os.environ.get("DATASET_VERSION", "1")

# Construct the paths to the data folder and data file used in this sample
script_dir = os.path.dirname(os.path.abspath(__file__))
data_folder = os.environ.get("DATA_FOLDER", os.path.join(script_dir, "data_folder"))
data_file = os.path.join(data_folder, "sample_data_evaluation.jsonl")

with DefaultAzureCredential() as credential:

    with AIProjectClient(endpoint=endpoint, credential=credential) as project_client:

        print("Upload a single file and create a new Dataset to reference the file.")
        dataset: DatasetVersion = project_client.datasets.upload_file(
            name=dataset_name or f"eval-data-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S_UTC')}",
            version=dataset_version,
            file_path=data_file,
        )
        pprint(dataset)

        print("Creating an OpenAI client from the AI Project client")

        client = project_client.get_openai_client()

        data_source_config = {
            "type": "custom",
            "item_schema": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "response": {"type": "string"},
                    "context": {"type": "string"},
                    "ground_truth": {"type": "string"},
                },
                "required": [],
            },
            "include_sample_schema": True,
        }

        testing_criteria = [
            {
                "type": "label_model",
                "model": "{{aoai_deployment_and_model}}",
                "input": [
                    {
                        "role": "developer",
                        "content": "Classify the sentiment of the following statement as one of 'positive', 'neutral', or 'negative'",
                    },
                    {"role": "user", "content": "Statement: {{item.query}}"},
                ],
                "passing_labels": ["positive", "neutral"],
                "labels": ["positive", "neutral", "negative"],
                "name": "label_grader",
            },
            {
                "type": "text_similarity",
                "input": "{{item.ground_truth}}",
                "evaluation_metric": "bleu",
                "reference": "{{item.response}}",
                "pass_threshold": 1,
                "name": "text_check_grader",
            },
            {
                "type": "string_check",
                "input": "{{item.ground_truth}}",
                "reference": "{{item.ground_truth}}",
                "operation": "eq",
                "name": "string_check_grader",
            },
            {
                "type": "score_model",
                "name": "score",
                "model": "{{aoai_deployment_and_model}}",
                "input": [
                    {
                        "role": "system",
                        "content": 'Evaluate the degree of similarity between the given output and the ground truth on a scale from 1 to 5, using a chain of thought to ensure step-by-step reasoning before reaching the conclusion.\n\nConsider the following criteria:\n\n- 5: Highly similar - The output and ground truth are nearly identical, with only minor, insignificant differences.\n- 4: Somewhat similar - The output is largely similar to the ground truth but has few noticeable differences.\n- 3: Moderately similar - There are some evident differences, but the core essence is captured in the output.\n- 2: Slightly similar - The output only captures a few elements of the ground truth and contains several differences.\n- 1: Not similar - The output is significantly different from the ground truth, with few or no matching elements.\n\n# Steps\n\n1. Identify and list the key elements present in both the output and the ground truth.\n2. Compare these key elements to evaluate their similarities and differences, considering both content and structure.\n3. Analyze the semantic meaning conveyed by both the output and the ground truth, noting any significant deviations.\n4. Based on these comparisons, categorize the level of similarity according to the defined criteria above.\n5. Write out the reasoning for why a particular score is chosen, to ensure transparency and correctness.\n6. Assign a similarity score based on the defined criteria above.\n\n# Output Format\n\nProvide the final similarity score as an integer (1, 2, 3, 4, or 5).\n\n# Examples\n\n**Example 1:**\n\n- Output: "The cat sat on the mat."\n- Ground Truth: "The feline is sitting on the rug."\n- Reasoning: Both sentences describe a cat sitting on a surface, but they use different wording. The structure is slightly different, but the core meaning is preserved. There are noticeable differences, but the overall meaning is conveyed well.\n- Similarity Score: 3\n\n**Example 2:**\n\n- Output: "The quick brown fox jumps over the lazy dog."\n- Ground Truth: "A fast brown animal leaps over a sleeping canine."\n- Reasoning: The meaning of both sentences is very similar, with only minor differences in wording. The structure and intent are well preserved.\n- Similarity Score: 4\n\n# Notes\n\n- Always aim to provide a fair and balanced assessment.\n- Consider both syntactic and semantic differences in your evaluation.\n- Consistency in scoring similar pairs is crucial for accurate measurement.',
                    },
                    {"role": "user", "content": "Output: {{item.response}}\nGround Truth: {{item.ground_truth}}"},
                ],
                "image_tag": "2025-05-08",
                "pass_threshold": 0.5,
            },
        ]

        print("Creating Eval Group")
        eval_object = client.evals.create(
            name="aoai graders test",
            data_source_config=data_source_config,
            testing_criteria=testing_criteria,
        )
        print(f"Eval Group created")

        print("Get Eval Group by Id")
        eval_object_response = client.evals.retrieve(eval_object.id)
        print("Eval Run Response:")
        pprint(eval_object_response)

        print("Creating Eval Run")
        eval_run_object = client.evals.runs.create(
            eval_id=eval_object.id,
            name="dataset",
            metadata={"team": "eval-exp", "scenario": "notifications-v1"},
            data_source=CreateEvalJSONLRunDataSourceParam(
                source=SourceFileID(id=dataset.id or "", type="file_id"), type="jsonl"
            ),
        )
        print(f"Eval Run created")
        pprint(eval_run_object)

        print("Get Eval Run by Id")
        eval_run_response = client.evals.runs.retrieve(run_id=eval_run_object.id, eval_id=eval_object.id)
        print("Eval Run Response:")
        pprint(eval_run_response)

        while True:
            run = client.evals.runs.retrieve(run_id=eval_run_response.id, eval_id=eval_object.id)
            if run.status == "completed" or run.status == "failed":
                output_items = list(client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id))
                pprint(output_items)
                print(f"Eval Run Report URL: {run.report_url}")

                break
            time.sleep(5)
            print("Waiting for eval run to complete...")