Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Similar to the Azure AI evaluation in GitHub Actions, an Azure DevOps extension is also available in the Azure DevOps Marketplace. This extension enables offline evaluation of AI agents within your CI/CD pipelines.
Features
Automated Evaluation: Integrate offline evaluation into your CI/CD workflows to automate the pre-production assessment of AI models.
Built-in Evaluators: Leverage existing evaluators provided by the Azure AI Evaluation SDK.
The following evaluators are supported:
Category Evaluator class/Metrics AI Agent evaluations GenAI evaluations General purpose (AI-assisted) QAEvaluator Not Supported Supported General purpose (AI-assisted) CoherenceEvaluator Supported Supported General purpose (AI-assisted) FluencyEvaluator Supported Supported Textual similarity SimilarityEvaluator Not Supported Supported Textual similarity F1ScoreEvaluator Not Supported Supported Textual similarity RougeScoreEvaluator Not Supported Not Supported Textual similarity GleuScoreEvaluator Not Supported Supported Textual similarity BleuScoreEvaluator Not Supported Supported Textual similarity MeteorScoreEvaluator Not Supported Supported Retrieval-augmented Generation (RAG) (AI-assisted) GroundednessEvaluator Supported Supported Retrieval-augmented Generation (RAG) (AI-assisted) GroundednessProEvaluator Not Supported Supported Retrieval-augmented Generation (RAG) (AI-assisted) RetrievalEvaluator Not Supported Supported Retrieval-augmented Generation (RAG) (AI-assisted) RelevanceEvaluator Supported Supported Retrieval-augmented Generation (RAG) (AI-assisted) ResponseCompletenessEvaluator Not Supported Supported Retrieval-augmented Generation (RAG) (AI-assisted) DocumentRetrievalEvaluator Not Supported Not Supported Risk and safety (AI-assisted) ViolenceEvaluator Supported Supported Risk and safety (AI-assisted) SexualEvaluator Supported Supported Risk and safety (AI-assisted) SelfHarmEvaluator Supported Supported Risk and safety (AI-assisted) HateUnfairnessEvaluator Supported Supported Risk and safety (AI-assisted) IndirectAttackEvaluator Supported Supported Risk and safety (AI-assisted) ProtectedMaterialEvaluator Supported Supported Risk and safety (AI-assisted) CodeVulnerabilityEvaluator Supported Supported Risk and safety (AI-assisted) UngroundedAttributesEvaluator Not Supported Supported Risk and safety (AI-assisted) ContentSafetyEvaluator Supported Supported Agent (AI-assisted) IntentResolutionEvaluator Supported Supported Agent (AI-assisted) TaskAdherenceEvaluator Supported Supported Agent (AI-assisted) ToolCallAccuracyEvaluator Supported Supported Composite AgentOverallEvaluatorNot Supported Not Supported Operational metrics Client run duration Supported Not Supported Operational metrics Server run duration Supported Not Supported Operational metrics Completion tokens Supported Not Supported Operational metrics Prompt tokens Supported Not Supported Custom evaluators Not Supported Not Supported
Prerequisites
- Foundry project or Hubs based project. To learn more, see Create a project.
- Install Azure AI evaluation extension.
- Go to Azure DevOps Marketplace.
- Search for Azure AI evaluation and install the extension into your Azure DevOps organization.
Set up YAML configuration file
- Create a new YAML file in your repository. You can use the sample YAML provided in the README or copy from the GitHub repo.
- Configure the following inputs:
- Set up Azure CLI with service connection and Azure Login.
- Foundry project connection string
- Dataset and evaluators
- Specify the evaluator names you want to use for this evaluation run.
- Queries (required).
- Agent IDs Retrieve agent identifiers from Foundry.
See the following sample dataset:
{
"name": "MyTestData",
"evaluators": [
"FluencyEvaluator",
"ViolenceEvaluator"
],
"data": [
{
"query": "Tell me about Tokyo?",
},
{
"query": "Where is Italy?",
}
]
}
A sample YAML file:
trigger:
- main
pool:
vmImage: 'windows-latest'
steps:
- task: AzureCLI@2
inputs:
addSpnToEnvironment: true
azureSubscription: ${{vars.Service_Connection_Name}}
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
echo "##vso[task.setvariable variable=ARM_CLIENT_ID]$servicePrincipalId"
echo "##vso[task.setvariable variable=ARM_ID_TOEKN]$idToken"
echo "##vso[task.setvariable variable=ARM_TENANT_ID]$tenantId"
- bash: |
az login --service-principal -u $(ARM_CLIENT_ID) --tenant $(ARM_TENANT_ID) --allow-no-subscriptions --federated-token $(ARM_ID_TOEKN)
displayName: 'Login Azure'
- task: UsePythonVersion@0
inputs:
versionSpec: '3.11'
- task: AIAgentEvaluation@0
inputs:
azure-ai-project-endpoint: "<your-ai-project-endpoint>"
deployment-name: "gpt-4o-mini"
data-path: $(Build.SourcesDirectory)\tests\data\golden-dataset-medium.json
agent-ids: "<your-ai-agent-ids>
Set up a new pipeline and trigger an evaluation run
Commit and run the pipeline in Azure DevOps.
View results
- Select the run and go to "Azure AI Evaluation" tab.
- The results are shown in this format:
- The top section summarizes the overview of two AI agent variants. You can select it on the agent ID link, and it directs you to the agent setting page in Microsoft Foundry portal. You can also select the link for Evaluation Results, and it directs you to Foundry portal to view individual result in detail.
- The second section includes evaluation scores and comparison between different variants on statistical significance (for multiple agents) and confidence intervals (for single agent).
Evaluation results and comparisons from multiple AI agents:
Single agent evaluation result: