建置和追蹤非結構化數據的擷取器工具

使用馬賽克 AI 代理程式架構來建置工具，讓 AI 代理程式查詢非結構化數據，例如檔集合。此頁面顯示如何：

在本機設計擷取程式
使用 Unity 目錄函式建立擷取器
查詢外部向量索引
新增 MLflow 追蹤功能以增強系統可觀察性

若要深入瞭解代理程式工具，請參閱 AI 代理程式工具。

使用 AI Bridge 在本機開發向量搜尋檢索工具

若要開始建置 Databricks 向量搜尋擷取器工具，最快的方式是使用像和 databricks-langchain 等 databricks-openai在本機進行開發及測試。

LangChain/LangGraph

安裝 databricks-langchain 的最新版本，其中包括 Databricks AI Bridge。

%pip install --upgrade databricks-langchain

下列程式代碼會建立擷取器工具的原型，以查詢假設向量搜尋索引，並將它系結至本機 LLM，以便測試其工具呼叫行為。

提供描述性 tool_description 來協助代理程序瞭解此工具，並判斷何時叫用此工具。

from databricks_langchain import VectorSearchRetrieverTool, ChatDatabricks

# Initialize the retriever tool.
vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation."
)

# Run a query against the vector search index locally for testing
vs_tool.invoke("Databricks Agent Framework?")

# Bind the retriever tool to your Langchain LLM of choice
llm = ChatDatabricks(endpoint="databricks-claude-sonnet-4-5")
llm_with_tools = llm.bind_tools([vs_tool])

# Chat with your LLM to test the tool calling functionality
llm_with_tools.invoke("Based on the Databricks documentation, what is Databricks Agent Framework?")

針對使用直接存取索引或使用自我管理內嵌的 Delta Sync 索引的案例，您必須設定 VectorSearchRetrieverTool 並指定自訂內嵌模型和文字數據行。瞭解提供嵌入功能的選項。

下列範例示範如何使用VectorSearchRetrieverTool和columns金鑰來設定embedding。

from databricks_langchain import VectorSearchRetrieverTool
from databricks_langchain import DatabricksEmbeddings

embedding_model = DatabricksEmbeddings(
    endpoint="databricks-bge-large-en",
)

vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
  num_results=5, # Max number of documents to return
  columns=["primary_key", "text_column"], # List of columns to include in the search
  filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
  query_type="ANN", # Query type ("ANN" or "HYBRID").
  tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
  tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
  text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
  embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

如需其他詳細數據，請參閱的 VectorSearchRetrieverTool。

OpenAI

安裝 databricks-openai 的最新版本，其中包括 Databricks AI Bridge。

%pip install --upgrade databricks-openai

下列程式代碼會建立擷取器的原型，其會查詢假設的向量搜尋索引，並將其與OpenAI的 GPT 模型整合。

提供描述性 tool_description 來協助代理程序瞭解此工具，並判斷何時叫用此工具。

如需工具 OpenAI 建議的詳細資訊，請參閱 OpenAI 函數呼叫檔。

from databricks_openai import VectorSearchRetrieverTool
from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key=<your_API_key>)

# Initialize the retriever tool
dbvs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation"
)

messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": "Using the Databricks documentation, answer what is Spark?"
  }
]
first_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

# Execute function code and parse the model's response and handle function calls.
tool_call = first_response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = dbvs_tool.execute(query=args["query"])  # For self-managed embeddings, optionally pass in openai_client=client

# Supply model with results – so it can incorporate them into its final response.
messages.append(first_response.choices[0].message)
messages.append({
  "role": "tool",
  "tool_call_id": tool_call.id,
  "content": json.dumps(result)
})
second_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

下列範例示範如何使用VectorSearchRetrieverTool和columns金鑰來設定embedding。

from databricks_openai import VectorSearchRetrieverTool

vs_tool = VectorSearchRetrieverTool(
    index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
    num_results=5, # Max number of documents to return
    columns=["primary_key", "text_column"], # List of columns to include in the search
    filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
    query_type="ANN", # Query type ("ANN" or "HYBRID").
    tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
    tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
    text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
    embedding_model_name="databricks-bge-large-en" # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

如需其他詳細數據，請參閱的 VectorSearchRetrieverTool。

當您的本機工具準備就緒后，您可以直接將其生產化為代理程式代碼的一部分，或將其移轉至 Unity 目錄函式，以提供更佳的可探索性和控管，但有某些限制。

下一節說明如何將擷取程式移轉至 Unity Catalog 函式。

與 Unity Catalog 功能搭配使用的向量搜尋檢索工具

您可以建立 Unity 目錄函式，以包裝馬賽克 AI 向量搜尋索引查詢。此方法：

支援具備治理和便於探索性功能的生產使用案例
在幕後使用 vector_search（） SQL 函式
支援 MLflow 的自動追蹤
- 您必須使用和 page_content 別名，將函式的輸出對齊 metadata。
- 任何額外的元數據列都必須使用 metadata新增至列，而不是作為頂層輸出鍵。

在筆記本或 SQL 編輯器中執行下列程式代碼，以建立函式：

CREATE OR REPLACE FUNCTION main.default.databricks_docs_vector_search (
  -- The agent uses this comment to determine how to generate the query string parameter.
  query STRING
  COMMENT 'The query string for searching Databricks documentation.'
) RETURNS TABLE
-- The agent uses this comment to determine when to call this tool. It describes the types of documents and information contained within the index.
COMMENT 'Executes a search on Databricks documentation to retrieve text documents most relevant to the input query.' RETURN
SELECT
  chunked_text as page_content,
  map('doc_uri', url, 'chunk_id', chunk_id) as metadata
FROM
  vector_search(
    -- Specify your Vector Search index name here
    index => 'catalog.schema.databricks_docs_index',
    query => query,
    num_results => 5
  )

若要在 AI 代理程式中使用此擷取器工具，請使用 UCFunctionToolkit將其包裹。這使得透過自動在 MLflow 記錄中生成 RETRIEVER 範圍類型，可以在 MLflow 中啟用自動追蹤。

from unitycatalog.ai.langchain.toolkit import UCFunctionToolkit

toolkit = UCFunctionToolkit(
    function_names=[
        "main.default.databricks_docs_vector_search"
    ]
)
tools = toolkit.tools

Unity 目錄擷取工具有下列注意事項：

SQL 用戶端可能會限制傳回的數據列或位元組數目上限。若要防止數據截斷，您應該截斷 UDF 傳回的數據行值。例如，您可以使用 substring(chunked_text, 0, 8192) 來減少大型內容數據行的大小，並避免執行期間的數據列截斷。
由於此工具是 vector_search() 函式的包裝函式，因此受限於與 vector_search() 函式相同的限制。請參閱限制。

如需 UCFunctionToolkit 的詳細資訊，請參閱 Unity 目錄檔。

查詢托管於 Databricks 之外的向量索引的擷取器

如果您的向量索引裝載於 Azure Databricks 外部，您可以建立 Unity 目錄連線以連線到外部服務，並在代理程式碼中使用連線。請參閱將 AI 代理程式工具連線到外部服務。

以下範例會建立一個檢索器，以針對 PyFunc 風格的代理程式呼叫 Databricks 外部托管的向量索引。

在此情況下，建立與外部服務 Azure 的 Unity 目錄連線。

CREATE CONNECTION ${connection_name}
TYPE HTTP
OPTIONS (
  host 'https://example.search.windows.net',
  base_path '/',
  bearer_token secret ('<secret-scope>','<secret-key>')
);

使用 Unity 目錄連線在代理程式代碼中定義擷取工具。此範例會使用 MLflow 裝飾器來啟用代理程式追蹤。

注意

若要符合 MLflow 擷取器架構，擷取器函式應該會傳回 List[Document] 物件，並使用 metadata Document 類別中的欄位，將其他屬性新增至傳回的檔，例如 doc_uri 和 similarity_score。請參閱 MLflow 檔。

import mlflow
import json

from mlflow.entities import Document
from typing import List, Dict, Any
from dataclasses import asdict

class VectorSearchRetriever:
  """
  Class using Databricks Vector Search to retrieve relevant documents.
  """

  def __init__(self):
    self.azure_search_index = "hotels_vector_index"

  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(self, query_vector: List[Any], score_threshold=None) -> List[Document]:
    """
    Performs vector search to retrieve relevant chunks.
    Args:
      query: Search query.
      score_threshold: Score threshold to use for the query.

    Returns:
      List of retrieved Documents.
    """
    from databricks.sdk import WorkspaceClient
    from databricks.sdk.service.serving import ExternalFunctionRequestHttpMethod

    json = {
      "count": true,
      "select": "HotelId, HotelName, Description, Category",
      "vectorQueries": [
        {
          "vector": query_vector,
          "k": 7,
          "fields": "DescriptionVector",
          "kind": "vector",
          "exhaustive": true,
        }
      ],
    }

    response = (
      WorkspaceClient()
      .serving_endpoints.http_request(
        conn=connection_name,
        method=ExternalFunctionRequestHttpMethod.POST,
        path=f"indexes/{self.azure_search_index}/docs/search?api-version=2023-07-01-Preview",
        json=json,
      )
      .text
    )

    documents = self.convert_vector_search_to_documents(response, score_threshold)
    return [asdict(doc) for doc in documents]

  @mlflow.trace(span_type="PARSER")
  def convert_vector_search_to_documents(
    self, vs_results, score_threshold
  ) -> List[Document]:
    docs = []

    for item in vs_results.get("value", []):
      score = item.get("@search.score", 0)

      if score >= score_threshold:
        metadata = {
          "score": score,
          "HotelName": item.get("HotelName"),
          "Category": item.get("Category"),
        }

        doc = Document(
          page_content=item.get("Description", ""),
          metadata=metadata,
          id=item.get("HotelId"),
        )
        docs.append(doc)

    return docs

若要執行擷取程式，請執行下列 Python 程式代碼。您可以選擇性地在篩選結果的要求中包含向量搜尋篩選。

retriever = VectorSearchRetriever()
query = [0.01944167, 0.0040178085 . . .  TRIMMED FOR BREVITY 010858015, -0.017496133]
results = retriever(query, score_threshold=0.1)

將追蹤新增至擷取器

新增 MLflow 追蹤以監視和偵錯您的擷取器。追蹤可讓您檢視每個執行步驟的輸入、輸出和元數據。

上述範例會將 @mlflow.trace 裝飾器新增至__call__和剖析方法兩者。裝飾器會建立範圍，該範圍會在函式被呼叫時啟動，並在函式回傳時結束。 MLflow 會自動記錄函式的輸入和輸出，以及引發的任何例外狀況。

注意

LangChain、LlamaIndex 和 OpenAI 函式庫的使用者除了使用裝飾器來手動定義操作記錄外，還可以使用 MLflow 自動記錄。請參見「新增痕跡至應用程式：自動與手動追蹤」。

import mlflow
from mlflow.entities import Document

## This code snippet has been truncated for brevity, see the full retriever example above
class VectorSearchRetriever:
  ...

  # Create a RETRIEVER span. The span name must match the retriever schema name.
  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(...) -> List[Document]:
    ...

  # Create a PARSER span.
  @mlflow.trace(span_type="PARSER")
  def parse_results(...) -> List[Document]:
    ...

若要確保代理程式評估和 AI 遊樂場等後續應用程式正確呈現檢索器的追蹤，請確定裝飾器符合以下要求：

使用（https://mlflow.org/docs/latest/tracing/tracing-schema.html#retriever-spans）並確定函式會傳回 List[Document] 物件。
追蹤名稱和 retriever_schema 名稱必須相符，才能正確設定追蹤。請參閱下一節，以瞭解如何設定擷取器架構。

設定擷取器架構以確保 MLflow 相容性

如果從擷取器傳回的追蹤或 span_type="RETRIEVER" 不符合 MLflow 的標準擷取器架構，您必須手動將傳回的架構對應至 MLflow 的預期欄位。這可確保 MLflow 可以正確地追蹤您的檢索器，並在下游應用程式中顯示追蹤。

若要手動設定擷取器架構：

當您定義代理程式時，請呼叫 mlflow.models.set_retriever_schema 。使用 set_retriever_schema 傳回資料表中的資料列名稱對應至 MLflow 的預期欄位，例如 primary_key、 text_column與 doc_uri。

# Define the retriever's schema by providing your column names
mlflow.models.set_retriever_schema(
  name="vector_search",
  primary_key="chunk_id",
  text_column="text_column",
  doc_uri="doc_uri"
  # other_columns=["column1", "column2"],
)

提供 other_columns 欄位的資料欄名稱清單，以指定擷取器架構中的其他欄位。
如果您有多個擷取器，您可以使用每個擷取器架構的唯一名稱來定義多個架構。

代理程式建立期間所設定的擷取程式架構會影響下游應用程式和工作流程，例如檢閱應用程式和評估集。具體而言，doc_uri 數據行可作為擷取器所傳回檔的主要標識碼。

檢閱應用程式 會顯示 doc_uri，以協助檢閱者評估回應和追蹤檔來源。請參閱檢查應用程式使用者介面。
評估集 會使用 doc_uri 來比較擷取器結果與預先定義的評估數據集，以判斷擷取器的召回率和精確度。請參閱評估集（MLflow 2）。

後續步驟

建立 Unity 目錄工具之後，請將它新增至您的代理程式。請參閱建立代理程式工具。

意見反應

此頁面對您有幫助嗎？

Last updated on 2025-12-09