Azure AI 평가 SDK를 사용하여 로컬로 생성 AI 애플리케이션 평가(미리 보기)

비고

🔍새 포털에 대해 알아보려면 Microsoft Foundry(새) 설명서를 참조하세요.

중요합니다

이 문서에 표시된 항목(미리 보기)은 현재 퍼블릭 미리 보기에서 확인할 수 있습니다. 이 미리 보기는 서비스 수준 계약 없이 제공되며, 프로덕션 워크로드에는 권장되지 않습니다. 특정 기능이 지원되지 않거나 기능이 제한될 수 있습니다. 자세한 내용은 Microsoft Azure 미리 보기에 대한 추가 사용 약관을 참조하세요.

상당한 데이터 세트에 적용하여 생성 AI 애플리케이션의 성능을 철저히 평가할 수 있습니다. Azure AI Evaluation SDK를 사용하여 개발 환경에서 애플리케이션을 평가합니다.

테스트 데이터 세트 또는 대상을 제공하는 경우 생성 AI 애플리케이션 출력은 수학 기반 메트릭과 AI 지원 품질 및 안전 평가자를 모두 사용하여 양적으로 측정됩니다. 기본 제공 또는 사용자 지정 평가자는 애플리케이션의 기능 및 제한 사항에 대한 포괄적인 인사이트를 제공할 수 있습니다.

이 문서에서는 애플리케이션 대상에서 단일 데이터 행 및 더 큰 테스트 데이터 세트에서 계산기를 실행하는 방법을 알아봅니다. Azure AI Evaluation SDK를 로컬로 사용하는 기본 제공 평가기를 사용합니다. 그런 다음 Foundry 프로젝트에서 결과 및 평가 로그를 추적하는 방법을 알아봅니다.

시작하기

먼저 Azure AI 평가 SDK에서 평가기 패키지를 설치합니다.

pip install azure-ai-evaluation

비고

자세한 내용은 Python용 Azure AI 평가 클라이언트 라이브러리를 참조하세요.

기본 제공 계산기

기본 제공 품질 및 안전 메트릭은 특정 평가자에 대한 추가 정보와 함께 쿼리 및 응답 쌍을 허용합니다.

카테고리	평가자
범용	`CoherenceEvaluator`, `FluencyEvaluatorQAEvaluator`
텍스트 유사성	`SimilarityEvaluator`, `F1ScoreEvaluator`, `BleuScoreEvaluator`, `GleuScoreEvaluator`, `RougeScoreEvaluatorMeteorScoreEvaluator`
RAG(검색 증강 생성)	`RetrievalEvaluator`, `DocumentRetrievalEvaluator`, `GroundednessEvaluator`, `GroundednessProEvaluator`, `RelevanceEvaluatorResponseCompletenessEvaluator`
위험 및 안전	`ViolenceEvaluator`,`SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluatorUngroundedAttributesEvaluator`, `CodeVulnerabilityEvaluatorContentSafetyEvaluator`
에이전트	`IntentResolutionEvaluator`, `ToolCallAccuracyEvaluatorTaskAdherenceEvaluator`
Azure OpenAI	`AzureOpenAILabelGrader`, `AzureOpenAIStringCheckGrader`, `AzureOpenAITextSimilarityGraderAzureOpenAIGrader`

기본 제공 평가기에 대한 데이터 요구 사항

기본 제공 평가자는 쿼리 및 응답 쌍, JSONL(JSON Lines) 형식의 대화 목록 또는 둘 다를 수락할 수 있습니다.

평가기	텍스트에 대한 대화 및 단일 턴 지원	텍스트 및 이미지에 대한 대화 및 단일 턴 지원	텍스트 전용 단일 턴 지원	필요함 `ground_truth`	에이전트 입력 지원
품질 평가자
`IntentResolutionEvaluator`					✓
`ToolCallAccuracyEvaluator`					✓
`TaskAdherenceEvaluator`					✓
`GroundednessEvaluator`	✓				✓
`GroundednessProEvaluator`	✓
`RetrievalEvaluator`	✓
`DocumentRetrievalEvaluator`	✓			✓
`RelevanceEvaluator`	✓				✓
`CoherenceEvaluator`	✓
`FluencyEvaluator`	✓
`ResponseCompletenessEvaluator`			✓	✓
`QAEvaluator`			✓	✓
NLP(자연어 처리) 계산기
`SimilarityEvaluator`			✓	✓
`F1ScoreEvaluator`			✓	✓
`RougeScoreEvaluator`			✓	✓
`GleuScoreEvaluator`			✓	✓
`BleuScoreEvaluator`			✓	✓
`MeteorScoreEvaluator`			✓	✓
안전 평가기
`ViolenceEvaluator`		✓
`SexualEvaluator`		✓
`SelfHarmEvaluator`		✓
`HateUnfairnessEvaluator`		✓
`ProtectedMaterialEvaluator`		✓
`ContentSafetyEvaluator`		✓
`UngroundedAttributesEvaluator`			✓
`CodeVulnerabilityEvaluator`			✓
`IndirectAttackEvaluator`	✓
Azure OpenAI Graders
`AzureOpenAILabelGrader`	✓
`AzureOpenAIStringCheckGrader`	✓
`AzureOpenAITextSimilarityGrader`	✓			✓
`AzureOpenAIGrader`	✓

비고

AI 지원 품질 평가자는 SimilarityEvaluator를 제외하고 이유 필드를 포함합니다. 그들은 점수에 대한 설명을 생성하기 위해 생각의 체인 추론과 같은 기술을 사용합니다.

평가 품질 향상의 결과로 생성에 더 많은 토큰 사용량을 사용합니다. 특히, 대부분의 AI 지원 평가자의 경우 평가자 생성을 위한 max_token이 800으로 설정됩니다. 더 긴 입력을 수용하기 위해 RetrievalEvaluator의 값은 1600이고 ToolCallAccuracyEvaluator의 값은 3000입니다.

Azure OpenAI 채점자에는 해당 입력 열이 채점자가 사용하는 실제 입력으로 변환되는 방법을 설명하는 템플릿이 필요합니다. 예를 들어 쿼리 및 응답이라는 두 개의 입력과 형식이 {{item.query}}지정된 템플릿이 있는 경우 쿼리만 사용됩니다. 마찬가지로 대화 입력을 수락하는 것과 같은 {{item.conversation}} 기능이 있을 수 있지만 시스템의 처리 기능은 해당 입력을 예상하도록 채점자의 나머지 부분을 구성하는 방법에 따라 달라집니다.

에이전트 평가자의 데이터 요구 사항에 대한 자세한 내용은 AI 에이전트 평가를 참조하세요.

텍스트에 대한 단일 턴 지원

모든 기본 제공 평가자는 단일 턴 입력을 문자열의 쿼리 및 응답 쌍으로 사용합니다. 다음은 그 예입니다.

from azure.ai.evaluation import RelevanceEvaluator

query = "What is the capital of life?"
response = "Paris."

# Initialize an evaluator:
relevance_eval = RelevanceEvaluator(model_config)
relevance_eval(query=query, response=response)

로컬 평가를 사용하여 일괄 처리를 실행하거나 데이터 세트를 업로드하여 클라우드 평가를 실행하려면 데이터 세트를 JSONL 형식으로 나타냅니다. 쿼리 및 응답 쌍인 이전 단일 턴 데이터는 다음 예제와 같은 데이터 세트의 줄에 해당하며, 세 줄을 보여 줍니다.

{"query":"What is the capital of France?","response":"Paris."}
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
{"query":"What color is my shirt?","response":"Blue."}

평가 테스트 데이터 세트는 각 기본 제공 평가기의 요구 사항에 따라 다음 요소를 포함할 수 있습니다.

쿼리: 생성 AI 애플리케이션으로 전송된 쿼리입니다.
응답: 생성 AI 애플리케이션에서 생성된 쿼리에 대한 응답입니다.
컨텍스트: 생성된 응답의 기반이 되는 원본입니다. 즉, 기초 문서입니다.
근거 진실: 사용자 또는 사람이 실제 답변으로 생성한 응답입니다.

각 평가기에 필요한 사항을 확인하려면 평가기를 참조하세요.

텍스트에 대한 대화 지원

텍스트를 위한 대화를 지원하는 평가자에게는 conversation를 입력으로 제공할 수 있습니다. 이 입력에는 messages, content, role 그리고 선택적으로 context를 포함하는 목록이 담긴 Python 사전이 포함되어 있습니다.

Python에서 다음 2턴 대화를 참조하세요.

conversation = {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": None
        }
        ]
}

로컬 평가를 사용하여 일괄 처리를 실행하거나 데이터 세트를 업로드하여 클라우드 평가를 실행하려면 데이터 세트를 JSONL 형식으로 나타내야 합니다. 이전 대화는 다음 예제와 같이 JSONL 파일의 데이터 세트 줄과 동일합니다.

{"conversation":
    {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": null
        }
        ]
    }
}

평가자는 대화의 첫 번째 턴이 쿼리-응답 형식에서 query에서 유효한 user, context에서 assistant, response에서 assistant를 제공한다는 것을 알고 있습니다. 대화는 턴별로 평가되고, 결과는 모든 턴에 대해 집계되어 대화 점수가 매겨집니다.

비고

두 번째 턴에서 context가 null이거나 키가 누락된 경우에도 평가자는 오류로 실패하는 대신 턴을 빈 문자열로 해석하므로 오해의 소지가 있는 결과가 발생할 수 있습니다.

데이터 요구 사항을 준수하기 위해 평가 데이터의 유효성을 검사하는 것이 좋습니다.

대화 모드의 경우 GroundednessEvaluator에 대한 예는 다음과 같습니다.

# Conversation mode:
import json
import os
from azure.ai.evaluation import GroundednessEvaluator, AzureOpenAIModelConfiguration

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

# Initialize the Groundedness evaluator:
groundedness_eval = GroundednessEvaluator(model_config)

conversation = {
    "messages": [
        { "content": "Which tent is the most waterproof?", "role": "user" },
        { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." },
        { "content": "How much does it cost?", "role": "user" },
        { "content": "$120.", "role": "assistant", "context": "The Alpine Explorer Tent is $120."}
    ]
}

# Alternatively, you can load the same content from a JSONL file.
groundedness_conv_score = groundedness_eval(conversation=conversation)
print(json.dumps(groundedness_conv_score, indent=4))

대화 출력의 경우, 턴별 결과는 목록에 저장되고 전체 대화 점수 'groundedness': 4.0은 턴별로 평균화됩니다.

{
    "groundedness": 5.0,
    "gpt_groundedness": 5.0,
    "groundedness_threshold": 3.0,
    "evaluation_per_turn": {
        "groundedness": [
            5.0,
            5.0
        ],
        "gpt_groundedness": [
            5.0,
            5.0
        ],
        "groundedness_reason": [
            "The response accurately and completely answers the query by stating that the Alpine Explorer Tent is the most waterproof, which is directly supported by the context. There are no irrelevant details or incorrect information present.",
            "The RESPONSE directly answers the QUERY with the exact information provided in the CONTEXT, making it fully correct and complete."
        ],
        "groundedness_result": [
            "pass",
            "pass"
        ],
        "groundedness_threshold": [
            3,
            3
        ]
    }
}

비고

더 많은 계산기 모델을 지원하려면 접두사 없이 키를 사용합니다. 예를 들면 groundedness.groundedness를 사용합니다.

이미지 및 멀티모달 텍스트 및 이미지에 대한 대화 지원

이미지 및 다중 모달 이미지 및 텍스트에 대한 대화를 지원하는 평가자의 경우 이미지 URL 또는 Base64로 인코딩된 이미지를 conversation전달할 수 있습니다.

지원되는 시나리오는 다음과 같습니다.

텍스트 입력을 통한 이미지 또는 텍스트 생성에 사용되는 여러 이미지
이미지 생성에 대한 텍스트 전용 입력입니다.
텍스트 생성에 대한 이미지 전용 입력입니다.

from pathlib import Path
from azure.ai.evaluation import ContentSafetyEvaluator
import base64

# Create an instance of an evaluator with image and multi-modal support.
safety_evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)

# Example of a conversation with an image URL:
conversation_image_url = {
    "messages": [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an AI assistant that understands images."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
                    },
                },
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
                }
            ],
        },
    ]
}

# Example of a conversation with base64 encoded images:
base64_image = ""

with Path.open("Image1.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode("utf-8")

conversation_base64 = {
    "messages": [
        {"content": "create an image of a branded apple", "role": "user"},
        {
            "content": [{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}}],
            "role": "assistant",
        },
    ]
}

# Run the evaluation on the conversation to output the result.
safety_score = safety_evaluator(conversation=conversation_image_url)

현재 이미지 및 다중 모드 평가기는 다음을 지원합니다.

단일 턴만: 대화에는 하나의 사용자 메시지와 하나의 도우미 메시지만 있을 수 있습니다.
하나의 시스템 메시지만 있는 대화입니다.
이미지를 포함하여 10MB보다 작은 대화 페이로드입니다.
절대 URL 및 Base64로 인코딩된 이미지
한 턴에 여러 이미지.
JPG/JPEG, PNG 및 GIF 파일 형식입니다.

설정

AI 지원 품질 평가자의 경우 미리 보기를 제외한 경우라면, GPT 모델(GroundednessProEvaluator, gpt-35-turbo, gpt-4, gpt-4-turbo 또는 gpt-4o)을 gpt-4o-mini에 지정해야 합니다. GPT 모델은 평가 데이터의 점수를 매기는 판사 역할을 합니다. Azure OpenAI 또는 OpenAI 모델 구성 스키마를 모두 지원합니다. 평가자와 함께 최상의 성능 및 구문 분석 가능한 응답을 위해 미리 보기에 없는 GPT 모델을 사용하는 것이 좋습니다.

비고

gpt-3.5-turbo을 평가기 모델에 대해 gpt-4o-mini로 대체합니다. OpenAI에 따르면, gpt-4o-mini 저렴하고, 더 능력 있고, 빠르다.

API 키를 사용하여 유추 호출을 하려면 Azure OpenAI 리소스에 Cognitive Services OpenAI User 대한 역할 이상이 있는지 확인합니다. 권한에 대한 자세한 내용은 Azure OpenAI 리소스에 대한 사용 권한을 참조하세요.

모든 위험 및 안전 평가자 및 GroundednessProEvaluator(미리 보기)의 경우 model_config에서 GPT 배포 대신 azure_ai_project 정보를 제공해야 합니다. 그러면 Foundry 프로젝트를 사용하여 백 엔드 평가 서비스에 액세스합니다.

AI 지원 내장 평가자에 대한 명령어

투명성을 위해 안전 평가기 및 GroundednessProEvaluator를 제외하고, 평가자 라이브러리 및 Azure AI 평가 Python SDK 리포지토리에서 품질 평가자의 프롬프트를 오픈 소스로 제공합니다. 이 기능은 Azure AI 콘텐츠 안전에 의해 지원됩니다. 이러한 프롬프트는 언어 모델이 평가 작업을 수행하기 위한 지침으로 사용되며, 메트릭 및 관련 점수 매기기 루브릭에 대한 인간 친화적인 정의가 필요합니다. 시나리오 세부 사항에 맞게 정의 및 채점 루브릭을 사용자 지정하는 것이 좋습니다. 자세한 내용은 사용자 지정 계산기를 참조하세요.

복합 계산기

복합 평가기는 개별 품질 또는 안전 메트릭을 결합하는 기본 제공 평가기입니다. 쿼리 응답 쌍 또는 채팅 메시지 모두에 대해 바로 다양한 메트릭을 제공합니다.

복합 계산기	포함	설명
`QAEvaluator`	`GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluatorF1ScoreEvaluator`	쿼리 및 응답 쌍에 대한 결합된 메트릭의 단일 출력에 대한 모든 품질 평가자를 결합합니다.
`ContentSafetyEvaluator`	`ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluatorHateUnfairnessEvaluator`	쿼리 및 응답 쌍에 대한 결합된 메트릭의 단일 출력에 대한 모든 안전 평가자를 결합합니다.

`evaluate()`를 사용하여 테스트 데이터 세트에 대한 로컬 평가

단일 데이터 행에서 기본 제공 또는 사용자 지정 계산기를 스폿 검사한 후 여러 평가기를 전체 테스트 데이터 세트의 evaluate() API와 결합할 수 있습니다.

Microsoft Foundry 프로젝트에 대한 필수 구성 요소 설정 단계

이 세션이 처음으로 평가를 실행하고 Foundry 프로젝트에 로깅하는 경우 다음 설정 단계를 수행해야 할 수 있습니다.

스토리지 계정을 만들고 리소스 수준에서 Foundry 프로젝트에 연결합니다. 이 bicep 템플릿은 키 인증을 사용하여 스토리지 계정을 프로비전하고 Foundry 프로젝트에 연결합니다.
연결된 스토리지 계정에 모든 프로젝트에 대한 액세스 권한이 있는지 확인합니다.
스토리지 계정을 Microsoft Entra ID로 연결한 경우 Azure Portal의 계정 및 Foundry 프로젝트 리소스 모두에 스토리지 Blob 데이터 소유자 에 대한 Microsoft ID 권한을 부여해야 합니다.

데이터 세트에서 평가 및 Foundry에 결과 기록

API가 evaluate() 데이터를 올바르게 구문 분석할 수 있도록 하려면 열 매핑을 지정하여 데이터 세트의 열을 평가자가 허용하는 키워드에 매핑해야 합니다. 이 예제에서는 query, response, 및 context에 대한 데이터 매핑을 지정합니다.

from azure.ai.evaluation import evaluate

result = evaluate(
    data="data.jsonl", # Provide your data here:
    evaluators={
        "groundedness": groundedness_eval,
        "answer_length": answer_length
    },
    # Column mapping:
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.queries}",
                "context": "${data.context}",
                "response": "${data.response}"
            } 
        }
    },
    # Optionally, provide your Foundry project information to track your evaluation results in your project portal.
    azure_ai_project = azure_ai_project,
    # Optionally, provide an output path to dump a JSON file of metric summary, row-level data, and the metric and Foundry project URL.
    output_path="./myevalresults.json"
)

팁 (조언)

Foundry 프로젝트에서 로그된 평가 결과를 볼 수 있는 링크의 result.studio_url 속성 내용을 가져옵니다.

계산기는 집계 metrics 및 행 수준 데이터 및 메트릭을 포함하는 사전을 출력합니다. 다음 예제 출력을 참조하세요.

{'metrics': {'answer_length.value': 49.333333333333336,
             'groundedness.gpt_groundeness': 5.0, 'groundedness.groundeness': 5.0},
 'rows': [{'inputs.response': 'Paris is the capital of France.',
           'inputs.context': 'Paris has been the capital of France since '
                                  'the 10th century and is known for its '
                                  'cultural and historical landmarks.',
           'inputs.query': 'What is the capital of France?',
           'outputs.answer_length.value': 31,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.context': 'Albert Einstein developed the theory of '
                                  'relativity, with his special relativity '
                                  'published in 1905 and general relativity in '
                                  '1915.',
           'inputs.query': 'Who developed the theory of relativity?',
           'outputs.answer_length.value': 51,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'The speed of light is approximately 299,792,458 '
                            'meters per second.',
           'inputs.context': 'The exact speed of light in a vacuum is '
                                  '299,792,458 meters per second, a constant '
                                  "used in physics to represent 'c'.",
           'inputs.query': 'What is the speed of light?',
           'outputs.answer_length.value': 66,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'}],
 'traces': {}}

`evaluate()` 요구 사항

API를 evaluate() 사용하려면 Foundry 프로젝트의 평가 결과 차트를 올바르게 표시하려면 특정 데이터 형식 및 계산기 매개 변수 키 이름이 필요합니다.

데이터 형식

API는 evaluate() JSONL 형식의 데이터만 허용합니다. 모든 기본 제공 계산기의 evaluate() 경우 필요한 입력 필드가 있는 다음 형식의 데이터가 필요합니다. 기본 제공 평가기에 필요한 데이터 입력에 대한 이전 섹션을 참조하세요. 다음 코드 조각은 한 줄의 모양에 대한 샘플입니다.

{
  "query":"What is the capital of France?",
  "context":"France is in Europe",
  "response":"Paris is the capital of France.",
  "ground_truth": "Paris"
}

평가기 매개 변수 형식

기본 제공 평가자를 전달할 때 evaluators 매개 변수 목록에서 올바른 키워드 매핑을 지정합니다. 다음 표에서는 Foundry 프로젝트에 기록할 때 기본 제공 평가자의 결과가 UI에 표시되는 데 필요한 키워드 매핑을 보여 줍니다.

평가기	키워드 매개 변수
`GroundednessEvaluator`	`"groundedness"`
`GroundednessProEvaluator`	`"groundedness_pro"`
`RetrievalEvaluator`	`"retrieval"`
`RelevanceEvaluator`	`"relevance"`
`CoherenceEvaluator`	`"coherence"`
`FluencyEvaluator`	`"fluency"`
`SimilarityEvaluator`	`"similarity"`
`F1ScoreEvaluator`	`"f1_score"`
`RougeScoreEvaluator`	`"rouge"`
`GleuScoreEvaluator`	`"gleu"`
`BleuScoreEvaluator`	`"bleu"`
`MeteorScoreEvaluator`	`"meteor"`
`ViolenceEvaluator`	`"violence"`
`SexualEvaluator`	`"sexual"`
`SelfHarmEvaluator`	`"self_harm"`
`HateUnfairnessEvaluator`	`"hate_unfairness"`
`IndirectAttackEvaluator`	`"indirect_attack"`
`ProtectedMaterialEvaluator`	`"protected_material"`
`CodeVulnerabilityEvaluator`	`"code_vulnerability"`
`UngroundedAttributesEvaluator`	`"ungrounded_attributes"`
`QAEvaluator`	`"qa"`
`ContentSafetyEvaluator`	`"content_safety"`

매개 변수를 설정하는 방법의 예는 evaluators 다음과 같습니다.

result = evaluate(
    data="data.jsonl",
    evaluators={
        "sexual":sexual_evaluator,
        "self_harm":self_harm_evaluator,
        "hate_unfairness":hate_unfairness_evaluator,
        "violence":violence_evaluator
    }
)

대상에 대한 로컬 평가

실행한 다음 평가하려는 쿼리 목록이 있는 경우 evaluate() API는 target 매개 변수도 지원합니다. 이 매개 변수는 응답을 수집하기 위해 애플리케이션에 쿼리를 보낸 다음 결과 쿼리 및 응답에서 평가자를 실행합니다.

대상은 디렉터리의 호출 가능한 클래스일 수 있습니다. 이 예제에는 대상으로 설정된 호출 가능한 클래스 askwiki.py 가 있는 Python 스크립트 askwiki() 가 있습니다. 간단한 askwiki 앱으로 보낼 수 있는 쿼리의 데이터 세트가 있는 경우 출력의 기초를 평가할 수 있습니다. "column_mapping"에서 데이터에 적절한 열 매핑을 지정해야 합니다. "default"를 사용하여 모든 평가자에 대한 열 매핑을 지정할 수 있습니다.

콘텐츠는 다음과 같습니다."data.jsonl"

{"query":"When was United States found ?", "response":"1776"}
{"query":"What is the capital of France?", "response":"Paris"}
{"query":"Who is the best tennis player of all time ?", "response":"Roger Federer"}

from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "groundedness": groundedness_eval
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.queries}",
                "context": "${outputs.context}",
                "response": "${outputs.response}"
            } 
        }
    }
)

피드백

이 페이지가 도움이 되었나요?

Last updated on 2025-11-28