使用 make_judge() 创建自定义评判器

自定义评委是基于 LLM 的评分器,根据特定质量标准评估 GenAI 代理。 本教程介绍如何创建自定义评委,并使用这些评审来评估客户支持代理。make_judge()

你将执行以下操作:

  1. 创建要评估的示例代理
  2. 定义三个自定义法官以评估不同的条件
  3. 使用测试用例生成评估数据集
  4. 运行评估并比较不同代理配置的结果

步骤 1:创建要评估的代理

创建响应客户支持问题的 GenAI 代理。 代理有一个控制系统提示的(假)旋钮,以便你可以轻松地比较法官的输出之间的“好”和“坏”对话。

  1. 初始化 OpenAI 客户端以连接到由 Databricks 托管的 LLM 或者由 OpenAI 托管的 LLM。

    Databricks 托管的 LLM

    使用 MLflow 获取一个 OpenAI 客户端,以连接到由 Databricks 托管的 LLMs。 从 可用的基础模型中选择一个模型。

    import mlflow
    from databricks.sdk import WorkspaceClient
    
    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    w = WorkspaceClient()
    client = w.serving_endpoints.get_open_ai_client()
    
    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
    

    OpenAI 托管的 LLM

    使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。 从 可用的 OpenAI 模型中选择一个模型。

    import mlflow
    import os
    import openai
    
    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
    
    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client connected to OpenAI SDKs
    client = openai.OpenAI()
    
    # Select an LLM
    model_name = "gpt-4o-mini"
    
  2. 定义客户支持代理:

    from mlflow.entities import Document
    from typing import List, Dict, Any, cast
    
    
    # This is a global variable that is used to toggle the behavior of the customer support agent
    RESOLVE_ISSUES = False
    
    
    @mlflow.trace(span_type="TOOL", name="get_product_price")
    def get_product_price(product_name: str) -> str:
        """Mock tool to get product pricing."""
        return f"${45.99}"
    
    
    @mlflow.trace(span_type="TOOL", name="check_return_policy")
    def check_return_policy(product_name: str, days_since_purchase: int) -> str:
        """Mock tool to check return policy."""
        if days_since_purchase <= 30:
            return "Yes, you can return this item within 30 days"
        return "Sorry, returns are only accepted within 30 days of purchase"
    
    
    @mlflow.trace
    def customer_support_agent(messages: List[Dict[str, str]]):
        # We use this toggle to see how the judge handles the issue resolution status
        system_prompt_postfix = (
            f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
            if not RESOLVE_ISSUES
            else ""
        )
    
        # Mock some tool calls based on the user's question
        user_message = messages[-1]["content"].lower()
        tool_results = []
    
        if "cost" in user_message or "price" in user_message:
            price = get_product_price("microwave")
            tool_results.append(f"Price: {price}")
    
        if "return" in user_message:
            policy = check_return_policy("microwave", 60)
            tool_results.append(f"Return policy: {policy}")
    
        messages_for_llm = [
            {
                "role": "system",
                "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
            },
            *messages,
        ]
    
        if tool_results:
            messages_for_llm.append({
                "role": "system",
                "content": f"Tool results: {', '.join(tool_results)}"
            })
    
        # Call LLM to generate a response
        output = client.chat.completions.create(
            model=model_name,  # This example uses Databricks hosted Claude 4 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
            messages=cast(Any, messages_for_llm),
        )
    
        return {
            "messages": [
                {"role": "assistant", "content": output.choices[0].message.content}
            ]
        }
    

步骤 2:定义自定义法官

定义三个自定义法官:

  • 使用输入和输出评估问题解决的法官。
  • 检查预期行为的法官。
  • 基于轨迹的评测器,它通过分析执行轨迹来验证工具调用。

使用 make_judge() 创建的法官返回 mlflow.entities.Feedback 对象。

示例判断 1:评估问题解决

此法官通过分析聊天历史记录(输入)和代理响应(输出)来评估客户问题是否已成功解决。

from mlflow.genai.judges import make_judge
import json


# Create a judge that evaluates issue resolution using inputs and outputs
issue_resolution_judge = make_judge(
    name="issue_resolution",
    instructions="""
Evaluate if the customer's issue was resolved in the conversation.

User's messages: {{ inputs }}
Agent's responses: {{ outputs }}

Rate the resolution status and respond with exactly one of these values:
- 'fully_resolved': Issue completely addressed with clear solution
- 'partially_resolved': Some help provided but not fully solved
- 'needs_follow_up': Issue not adequately addressed

Your response must be exactly one of: 'fully_resolved', 'partially_resolved', or 'needs_follow_up'.
""",
)

示例判断 2:检查预期行为

此法官通过比较输出与预定义预期来验证代理响应是否演示了特定的预期行为(例如提供定价信息或解释返回策略)。

# Create a judge that checks against expected behaviors
expected_behaviors_judge = make_judge(
    name="expected_behaviors",
    instructions="""
Compare the agent's response in {{ outputs }} against the expected behaviors in {{ expectations }}.

User's question: {{ inputs }}

Determine if the response exhibits the expected behaviors and respond with exactly one of these values:
- 'meets_expectations': Response exhibits all expected behaviors
- 'partially_meets': Response exhibits some but not all expected behaviors
- 'does_not_meet': Response does not exhibit expected behaviors

Your response must be exactly one of: 'meets_expectations', 'partially_meets', or 'does_not_meet'.
""",
)

示例判定工具 3:使用基于跟踪的评估器验证工具函数调用

此法官分析执行跟踪,以验证是否调用了适当的工具。 在指令中包括 {{ trace }} 时,法官将变为基于跟踪的,并具备自主进行跟踪探索的能力。

# Create a trace-based judge that validates tool calls from the trace
tool_call_judge = make_judge(
    name="tool_call_correctness",
    instructions="""
Analyze the execution {{ trace }} to determine if the agent called appropriate tools for the user's request.

Examine the trace to:
1. Identify what tools were available and their purposes
2. Determine which tools were actually called
3. Assess whether the tool calls were reasonable for addressing the user's question

Evaluate the tool usage and respond with a boolean value:
- true: The agent called the right tools to address the user's request
- false: The agent called wrong tools, missed necessary tools, or called unnecessary tools

Your response must be a boolean: true or false.
""",
    # To analyze a full trace with a trace-based judge, a model must be specified
    model="databricks:/databricks-gpt-5-mini",
)

步骤 3:创建示例评估数据集

每个 inputs 都通过 mlflow.genai.evaluate() 传给代理。 可以选择包括 expectations 以启用正确性检查器。

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
        "expectations": {
            "should_provide_pricing": True,
            "should_offer_alternatives": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
        "expectations": {
            "should_mention_return_policy": True,
            "should_ask_for_receipt": False,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
        "expectations": {
            "should_provide_troubleshooting_steps": True,
            "should_escalate_if_needed": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
        "expectations": {
            "should_remain_calm": True,
            "should_provide_solution": True,
        },
    },
]

第四步:通过评委来评估你的代理

可以将多个评委一起使用来评估代理的不同方面。 运行评估以比较代理尝试解决问题时的行为与未解决问题时的行为。

import mlflow

# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False

result_unresolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,      # Checks inputs/outputs
        expected_behaviors_judge,    # Checks expected behaviors
        tool_call_judge,             # Validates tool usage
    ],
)

# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True

result_resolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,
        expected_behaviors_judge,
        tool_call_judge,
    ],
)

评估结果显示每个法官如何对代理进行评分:

  • issue_resolution:将对话分类为“fully_resolved”、“partially_resolved”或“needs_follow_up”
  • expected_behaviors:检查响应是否表现出预期行为(“符合预期”、“部分符合”、“不符合”)
  • tool_call_correctness:验证是否调用了适当的工具(true/false)

后续步骤

应用自定义评判器:

提高法官准确性:

  • 使评判算法与人工反馈保持一致 - 基础评判算法是一个起点。 当您收集有关应用程序输出的专家反馈时,请将 LLM 模型评判标准与反馈对齐,以进一步提高模型评判的准确性。