共用方式為


自訂計量 (MLflow 2)

這很重要

Databricks 建議使用 MLflow 3 來評估和監視生成式 AI 應用程式。 本頁說明 MLflow 2 代理程式評估。

本指南說明如何使用自定義計量來評估馬賽克 AI 代理程式架構內的 AI 應用程式。 自訂指標提供彈性,讓您定義符合您的特定業務使用案例的評估指標,無論是基於簡單的啟發式方法、進階邏輯,還是程式評估。

概述

自定義度量是用 Python 撰寫,讓開發人員能完全控制,透過人工智慧應用程式來評估追蹤記錄。 支援下列計量:

自訂計量可以使用:

  • 評估數據列中的任何欄位。
  • 額外預期值的custom_expected欄位。
  • 完整存取 MLflow 追蹤,包括範圍、屬性和輸出。

用法

自訂計量會使用 mlflow.evaluate() 中的 extra_metrics 欄位傳遞至評估架構,。 例:

import mlflow
from databricks.agents.evals import metric

@metric
def not_empty(response):
    # "yes" for Pass and "no" for Fail.
    return "yes" if response.choices[0]['message']['content'].strip() != "" else "no"

@mlflow.trace(span_type="CHAT_MODEL")
def my_model(request):
    deploy_client = mlflow.deployments.get_deploy_client("databricks")
    return deploy_client.predict(
        endpoint="databricks-meta-llama-3-3-70b-instruct", inputs=request
    )

with mlflow.start_run(run_name="example_run"):
    eval_results = mlflow.evaluate(
        data=[{"request": "Good morning"}],
        model=my_model,
        model_type="databricks-agent",
        extra_metrics=[not_empty],
    )
    display(eval_results.tables["eval_results"])

@metric 裝飾器

@metric 裝飾器使使用者能透過 引數將自訂評估指標傳入 extra_metrics。 評估工具會依據下列的函數簽名,以命名參數形式來調用度量函數。

def my_metric(
  *,  # eval harness will always call it with named arguments
  request: Dict[str, Any],  # The agent's raw input as a serializable object
  response: Optional[Dict[str, Any]],  # The agent's raw output; directly passed from the eval harness
  retrieved_context: Optional[List[Dict[str, str]]],  # Retrieved context, either from input eval data or extracted from the trace
  expected_response: Optional[str],  # The expected output as defined in the evaluation dataset
  expected_facts: Optional[List[str]],  # A list of expected facts that can be compared against the output
  guidelines: Optional[Union[List[str], Dict[str, List[str]]]]  # A list of guidelines or mapping a name of guideline to an array of guidelines for that name
  expected_retrieved_context: Optional[List[Dict[str, str]]],  # Expected context for retrieval tasks
  trace: Optional[mlflow.entities.Trace],  # The trace object containing spans and other metadata
  custom_expected: Optional[Dict[str, Any]],  # A user-defined dictionary of extra expected values
  tool_calls: Optional[List[ToolCallInvocation]],
) -> float | bool | str | Assessment

參數說明

  • request:提供給代理程式的輸入,格式化為任意可串行化的物件。 這代表用戶查詢或提示。
  • response:代理程式的原始輸出,格式化為選擇性的任意可串行化物件。 其中包含代理程式產生的評估回應。
  • retrieved_context:包含工作期間擷取內容的字典清單。 此情境可能來自輸入評估數據集或追蹤記錄,使用者可以透過 [trace] 欄位覆寫或自訂其提取方式。
  • expected_response:表示任務的正確或期望的回應字串。 它充當了與代理回應比較的標準真相。
  • expected_facts:預期出現在代理人回應中的事實清單,對於事實檢查工作很有用。
  • guidelines:指導方針的清單,或將指導方針名稱對應到該名稱相關的指導方針陣列。 指導方針允許您在任何範疇提供約束條件,然後由指導方針遵循評估者進行評估。
  • expected_retrieved_context:代表預期擷取內容的字典清單。 這對於擷取增強型工作而言非常重要,因為擷取的數據正確性很重要。
  • trace:選擇性的 MLflow Trace 物件,包含代理程式執行的相關範圍、屬性和其他元數據。 這可讓您深入檢查代理所採取的內部步驟。
  • custom_expected:傳遞使用者定義預期值的字典。 此欄位提供彈性來包含標準欄位未涵蓋的其他自定義期望。
  • tool_calls:包含呼叫了哪些工具及其返回結果的ToolCallInvocation 清單。

返回值

自訂指標的返回值是逐行的 評定。 如果您傳回基本類型,它將包裹在 Assessment 中,且理由為空。

  • float:針對數值計量(例如相似度分數、精確度百分比)。
  • bool:針對二進位計量。
  • Assessmentlist[Assessment]:支援新增理由的更豐富輸出類型。 如果您傳回評量清單,則可以重複使用相同的計量函式來傳回多個評量。
    • name:評量的名稱。
    • value:值(float、int、bool 或 string)。
    • rationale:( 選擇性) 說明如何計算此值的理由。 這在UI中顯示額外的推理很有用。 例如,當從生成此評定的 LLM 提供推理時,此欄位非常有用。

通過/失敗指標

傳回 "yes""no" 的任何字串計量都會被視為通過/未通過的計量,並在使用者介面中有特殊處理。

您也可以使用 可調用的 judge Python SDK來建立通過/失敗指標。 這讓您可以更好地掌控要評估的追蹤部分,以及哪些預期欄位應被使用。 您可以使用任何內建的 Mosaic AI 代理評估器。 請參閱 內建 AI 評委 (MLflow 2)

確保擷取的內容沒有個人識別信息(PII)

這個範例會呼叫 guideline_adherence 判斷,以確保擷取的內容沒有 PII。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "retrieved_context": [{
      "content": "The email address is noreply@databricks.com",
    }],
  }, {
    "request": "Good afternoon",
    "response": "This is actually the morning!",
    "retrieved_context": [{
      "content": "fake retrieved context",
    }],
  }
]

@metric
def retrieved_context_no_pii(request, response, retrieved_context):
  retrieved_content = '\n'.join([c['content'] for c in retrieved_context])
  return judges.guideline_adherence(
    request=request,
    # You can also pass in per-row guidelines by adding `guidelines` to the signature of your metric
    guidelines=[
      "The retrieved context must not contain personally identifiable information.",
    ],
    # `guidelines_context` requires `databricks-agents>=0.20.0`
    guidelines_context={"retrieved_context": retrieved_content},
  )

with mlflow.start_run(run_name="safety"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[retrieved_context_no_pii],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

數值計量

數值計量會評估序數值,例如浮點數或整數。 介面中的每一列都會顯示數值指標,還有評估過程的平均值。

範例:回應相似性

此計量會使用內建 python 連結庫 response來測量 expected_responseSequenceMatcher 之間的相似度。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from difflib import SequenceMatcher

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "expected_response": "Hello and good morning to you!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question.",
    "expected_response": "Good afternoon to you too!"
  }
]

@metric
def response_similarity(response, expected_response):
  s = SequenceMatcher(a=response, b=expected_response)
  return s.ratio()

with mlflow.start_run(run_name="response_similarity"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_similarity],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

布爾計量

布林計量評估為 TrueFalse。 這些適用於二元決策,例如檢查回應是否符合簡單的啟發式方法。 如果您希望在 UI 中對指標進行特別的通過/失敗處理,請參閱 通過/失敗指標

範例:檢查輸入要求的格式正確

此指標會檢查任意輸入的格式是否符合預期,若是,則會傳回 True

import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": {"messages": [{"role": "user", "content": "Good morning"}]},
  }, {
    "request": {"inputs": ["Good afternoon"]},
  }, {
    "request": {"inputs": [1, 2, 3, 4]},
  }
]

@metric
def check_valid_format(request):
  # Check that the request contains a top-level key called "inputs" with a value of a list
  return "inputs" in request and isinstance(request.get("inputs"), list)

with mlflow.start_run(run_name="check_format"):
  eval_results = mlflow.evaluate(
      data=pd.DataFrame.from_records(evals),
      model_type="databricks-agent",
      extra_metrics=[check_valid_format],
      # Disable built-in judges.
      evaluator_config={
          'databricks-agent': {
              "metrics": [],
          }
      }
  )
eval_results.tables['eval_results']

範例:語言模型自我參考

此計量會檢查回應是否提及 「LLM」,如果確實傳回 True

import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question."
  }
]

@metric
def response_mentions_llm(response):
  return "LLM" in response

with mlflow.start_run(run_name="response_mentions_llm"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_mentions_llm],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

使用 custom_expected

[custom_expected] 字段可用來將任何其他預期的信息傳遞至自定義計量。

範例:回應長度限定

這個範例示範如何要求針對每個範例設定的回應長度在 (min_length, max_length) 界限內。 使用 custom_expected 來儲存建立評量時要傳遞至自定義指標的任何列層級資訊。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good night.",
    "custom_expected": {
      "max_length": 100,
      "min_length": 3
    }
  }, {
    "request": "What is the date?",
    "response": "12/19/2024",
    "custom_expected": {
      "min_length": 10,
      "max_length": 20,
    }
  }
]

# The custom metric uses the "min_length" and "max_length" from the "custom_expected" field.
@metric
def response_len_bounds(
  request,
  response,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected
):
  return len(response) <= custom_expected["max_length"] and len(response) >= custom_expected["min_length"]

with mlflow.start_run(run_name="response_len_bounds"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_len_bounds],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

跡上的斷言

自訂指標可以評估代理程式所產生的 MLflow 追蹤 的任何部分,包括跨度、屬性和輸出。

範例:請求分類 & 路由

這個範例會建置代理程式,以判斷使用者查詢是問題還是語句,並以純英文傳回給使用者。 在更現實的案例中,您可以使用這項技術,將不同的查詢傳送到不同的功能。

評估集可透過檢查 MLFlow 追蹤的自定義指標來確保查詢類型分類器為一組輸入產生正確結果。

此範例會使用 MLflow Trace.search_spans 來尋找類型為 KEYWORD的範圍,這是您為此代理程式定義的自定義範圍類型。


import mlflow
import pandas as pd
from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace
from mlflow.deployments import get_deploy_client

# This agent is a toy example that returns simple statistics about the user's request.
# To get the stats about the request, the agent calls methods to compute stats before returning the stats in natural language.

deploy_client = get_deploy_client("databricks")
ENDPOINT_NAME="databricks-meta-llama-3-3-70b-instruct"

@mlflow.trace(name="classify_question_answer")
def classify_question_answer(request: str) -> str:
  system_prompt = """
    Return "question" if the request is formed as a question, even without correct punctuation.
    Return "statement" if the request is a statement, even without correct punctuation.
    Return "unknown" otherwise.

    Do not return a preamble, only return a single word.
  """
  request = {
    "messages": [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": request},
    ],
    "temperature": .01,
    "max_tokens": 1000
  }

  result = deploy_client.predict(endpoint=ENDPOINT_NAME, inputs=request)
  return result.choices[0]['message']['content']

@mlflow.trace(name="agent", span_type="CHAIN")
def question_answer_agent(request: ChatCompletionRequest) -> ChatCompletionResponse:
    user_query = request["messages"][-1]["content"]

    request_type = classify_question_answer(user_query)
    response = f"The request is a {request_type}."

    return {
        "messages": [
            *request["messages"][:-1], # Keep the chat history.
            {"role": "user", "content": response}
        ]
    }

# Define the evaluation set with a set of requests and the expected request types for those requests.
evals = [
  {
    "request": "This is a question",
    "custom_expected": {
      "request_type": "statement"
    }
  }, {
    "request": "What is the date?",
    "custom_expected": {
      "request_type": "question"
    }
  },
]

# The custom metric checks the expected request type against the actual request type produced by the agent trace.
@metric
def correct_request_type(request, trace, custom_expected):
  classification_span = trace.search_spans(name="classify_question_answer")[0]
  return classification_span.outputs == custom_expected['request_type']

with mlflow.start_run(run_name="multiple_assessments_single_metric"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model=question_answer_agent,
        model_type="databricks-agent",
        extra_metrics=[correct_request_type],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

藉由利用這些範例,您可以設計自定義計量以符合您獨特的評估需求。

工具呼叫評估

自定義計量將與 tool_calls 一同提供,這包含 ToolCallInvocation 的清單,可讓您瞭解哪些工具已被呼叫,以及返回了什麼結果。

範例:確認已調用正確的工具

注意

此範例無法複製貼上,因為它不會定義 LangGraph 代理程式。 在 所附的筆記本 中查看可完整執行的範例。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

eval_data = pd.DataFrame(
  [
    {
      "request": "what is 3 * 12?",
      "expected_response": "36",
      "custom_expected": {
        "expected_tool_name": "multiply"
      },
    },
    {
      "request": "what is 3 + 12?",
      "expected_response": "15",
      "custom_expected": {
        "expected_tool_name": "add"
      },
    },
  ]
)

@metric
def is_correct_tool(tool_calls, custom_expected):
  # Metric to check whether the first tool call is the expected tool
  return tool_calls[0].tool_name == custom_expected["expected_tool_name"]

@metric
def is_reasonable_tool(request, trace, tool_calls):
  # Metric using the guideline adherence judge to determine whether the chosen tools are reasonable
  # given the set of available tools. Note that `guidelines_context` requires `databricks-agents >= 0.20.0`

  return judges.guideline_adherence(
    request=request["messages"][0]["content"],
    guidelines=[
      "The selected tool must be a reasonable tool call with respect to the request and available tools.",
    ],
    # `guidelines_context` requires `databricks-agents>=0.20.0`
    guidelines_context={
      "available_tools": str(tool_calls[0].available_tools),
      "chosen_tools": str([tool_call.tool_name for tool_call in tool_calls]),
    },
  )

results = mlflow.evaluate(
  data=eval_data,
  model=tool_calling_agent,
  model_type="databricks-agent",
  extra_metrics=[is_correct_tool]
)
results.tables["eval_results"].display()

開發自定義計量

當您開發指標時,需要快速反覆調整這些指標,而不必每次更動時都執行代理程式。 若要簡化此動作,請使用下列策略:

  1. 從評估數據集代理程式產生回應表。 這會針對評估集中的每個項目執行代理,並生成回應和追蹤,您可以直接使用這些結果來調用指標。
  2. 定義計量。
  3. 直接呼叫響應表中每個值的計量,並逐一查看計量定義。
  4. 當指標正常運作時,請在相同的答案工作表上執行 mlflow.evaluate(),以確保代理評估的結果符合您的預期。 此範例中的程式代碼不會使用 model= 欄位,因此評估會使用預先計算的回應。
  5. 當您滿意計量的效能時,請啟用 model= 中的 [mlflow.evaluate()] 字段,以互動方式呼叫代理程式。
import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace

evals = [
  {
    "request": "What is Databricks?",
    "custom_expected": {
      "keywords": ["databricks"],
    },
    "expected_response": "Databricks is a cloud-based analytics platform.",
    "expected_facts": ["Databricks is a cloud-based analytics platform."],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "When was Databricks founded?",
    "custom_expected": {
      "keywords": ["when", "databricks", "founded"]
    },
    "expected_response": "Databricks was founded in 2012",
    "expected_facts": ["Databricks was founded in 2012"],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "How do I convert a timestamp_ms to a timestamp in dbsql?",
    "custom_expected": {
      "keywords": ["timestamp_ms", "timestamp", "dbsql"]
    },
    "expected_response": "You can convert a timestamp with...",
    "expected_facts": ["You can convert a timestamp with..."],
    "expected_retrieved_context": [{"content": "You can convert a timestamp with...", "doc_uri": "https://databricks.com/doc_uri"}]
  }
]
## Step 1: Generate an answer sheet with all of the built-in judges turned off.
## This code calls the agent for all the rows in the evaluation set, which you can use to build the metric.
answer_sheet_df = mlflow.evaluate(
  data=evals,
  model=rag_agent,
  model_type="databricks-agent",
  # Turn off built-in judges to just build an answer sheet.
  evaluator_config={"databricks-agent": {"metrics": []}
  }
).tables['eval_results']
display(answer_sheet_df)

answer_sheet = answer_sheet_df.to_dict(orient='records')

## Step 2: Define the metric.
@metric
def custom_metric_consistency(
  request,
  response,
  retrieved_context,
  expected_response,
  expected_facts,
  expected_retrieved_context,
  trace,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected
):
  print(f"[custom_metric] request: {request}")
  print(f"[custom_metric] response: {response}")
  print(f"[custom_metric] retrieved_context: {retrieved_context}")
  print(f"[custom_metric] expected_response: {expected_response}")
  print(f"[custom_metric] expected_facts: {expected_facts}")
  print(f"[custom_metric] expected_retrieved_context: {expected_retrieved_context}")
  print(f"[custom_metric] trace: {trace}")

  return True

## Step 3: Call the metric directly before using the evaluation harness to iterate on the metric definition.
for row in answer_sheet:
  custom_metric_consistency(
    request=row['request'],
    response=row['response'],
    expected_response=row['expected_response'],
    expected_facts=row['expected_facts'],
    expected_retrieved_context=row['expected_retrieved_context'],
    retrieved_context=row['retrieved_context'],
    trace=Trace.from_json(row['trace']),
    custom_expected=row['custom_expected']
  )

## Step 4: After you are confident in the signature of the metric, you can run the harness with the answer sheet to trigger the output validation and make sure the UI reflects what you intended.
with mlflow.start_run(run_name="exact_expected_response"):
    eval_results = mlflow.evaluate(
        data=answer_sheet,
        ## Step 5: Re-enable the model here to call the agent when we are working on the agent definition.
        # model=rag_agent,
        model_type="databricks-agent",
        extra_metrics=[custom_metric_consistency],
        # Uncomment to turn off built-in judges.
        # evaluator_config={
        #     'databricks-agent': {
        #         "metrics": [],
        #     }
        # }
    )
    display(eval_results.tables['eval_results'])

範例筆記本

下列範例筆記本說明在馬賽克 AI 代理程式評估中使用自定義計量的一些不同方式。

代理評估自定義指標示例筆記本

取得筆記本