How to run an evaluation in Azure DevOps (preview)

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Similar to the Azure AI evaluation in GitHub Actions, an Azure DevOps extension is also available in the Azure DevOps Marketplace. This extension enables offline evaluation of AI agents within your CI/CD pipelines.

Features

Automated Evaluation: Integrate offline evaluation into your CI/CD workflows to automate the pre-production assessment of AI models.

Built-in Evaluators: Leverage existing evaluators provided by the Azure AI Evaluation SDK.

The following evaluators are supported:

Category	Evaluator class/Metrics	AI Agent evaluations	GenAI evaluations
General purpose (AI-assisted)	QAEvaluator	Not Supported	Supported
General purpose (AI-assisted)	CoherenceEvaluator	Supported	Supported
General purpose (AI-assisted)	FluencyEvaluator	Supported	Supported
Textual similarity	SimilarityEvaluator	Not Supported	Supported
Textual similarity	F1ScoreEvaluator	Not Supported	Supported
Textual similarity	RougeScoreEvaluator	Not Supported	Not Supported
Textual similarity	GleuScoreEvaluator	Not Supported	Supported
Textual similarity	BleuScoreEvaluator	Not Supported	Supported
Textual similarity	MeteorScoreEvaluator	Not Supported	Supported
Retrieval-augmented Generation (RAG) (AI-assisted)	GroundednessEvaluator	Supported	Supported
Retrieval-augmented Generation (RAG) (AI-assisted)	GroundednessProEvaluator	Not Supported	Supported
Retrieval-augmented Generation (RAG) (AI-assisted)	RetrievalEvaluator	Not Supported	Supported
Retrieval-augmented Generation (RAG) (AI-assisted)	RelevanceEvaluator	Supported	Supported
Retrieval-augmented Generation (RAG) (AI-assisted)	ResponseCompletenessEvaluator	Not Supported	Supported
Retrieval-augmented Generation (RAG) (AI-assisted)	DocumentRetrievalEvaluator	Not Supported	Not Supported
Risk and safety (AI-assisted)	ViolenceEvaluator	Supported	Supported
Risk and safety (AI-assisted)	SexualEvaluator	Supported	Supported
Risk and safety (AI-assisted)	SelfHarmEvaluator	Supported	Supported
Risk and safety (AI-assisted)	HateUnfairnessEvaluator	Supported	Supported
Risk and safety (AI-assisted)	IndirectAttackEvaluator	Supported	Supported
Risk and safety (AI-assisted)	ProtectedMaterialEvaluator	Supported	Supported
Risk and safety (AI-assisted)	CodeVulnerabilityEvaluator	Supported	Supported
Risk and safety (AI-assisted)	UngroundedAttributesEvaluator	Not Supported	Supported
Risk and safety (AI-assisted)	ContentSafetyEvaluator	Supported	Supported
Agent (AI-assisted)	IntentResolutionEvaluator	Supported	Supported
Agent (AI-assisted)	TaskAdherenceEvaluator	Supported	Supported
Agent (AI-assisted)	ToolCallAccuracyEvaluator	Supported	Supported
Composite	`AgentOverallEvaluator`	Not Supported	Not Supported
Operational metrics	Client run duration	Supported	Not Supported
Operational metrics	Server run duration	Supported	Not Supported
Operational metrics	Completion tokens	Supported	Not Supported
Operational metrics	Prompt tokens	Supported	Not Supported
Custom evaluators		Not Supported	Not Supported

Prerequisites

Foundry project or Hubs based project. To learn more, see Create a project.
Install Azure AI evaluation extension.
- Go to Azure DevOps Marketplace.
- Search for Azure AI evaluation and install the extension into your Azure DevOps organization.

Set up YAML configuration file

Create a new YAML file in your repository. You can use the sample YAML provided in the README or copy from the GitHub repo.
Configure the following inputs:
- Set up Azure CLI with service connection and Azure Login.
- Foundry project connection string
- Dataset and evaluators
  - Specify the evaluator names you want to use for this evaluation run.
  - Queries (required).
- Agent IDs Retrieve agent identifiers from Foundry.

See the following sample dataset:

{ 
  "name": "MyTestData", 
  "evaluators": [ 
    "FluencyEvaluator", 
    "ViolenceEvaluator" 
  ], 
  "data": [ 

    { 
      "query": "Tell me about Tokyo?", 
    }, 
    { 
      "query": "Where is Italy?", 
    } 
  ] 
}

A sample YAML file:

Foundry project
Hub-based project


trigger: 
- main 
pool: 

  vmImage: 'windows-latest'  

steps: 

- task: AzureCLI@2 
  inputs: 
    addSpnToEnvironment: true 
    azureSubscription: ${{vars.Service_Connection_Name}}
    scriptType: bash 
    scriptLocation: inlineScript     

    inlineScript: | 
      echo "##vso[task.setvariable variable=ARM_CLIENT_ID]$servicePrincipalId"  
      echo "##vso[task.setvariable variable=ARM_ID_TOEKN]$idToken" 
      echo "##vso[task.setvariable variable=ARM_TENANT_ID]$tenantId" 

- bash: | 

   az login --service-principal -u $(ARM_CLIENT_ID) --tenant $(ARM_TENANT_ID) --allow-no-subscriptions --federated-token $(ARM_ID_TOEKN) 

  displayName: 'Login Azure' 
 
- task: UsePythonVersion@0 
  inputs: 
    versionSpec: '3.11' 
- task: AIAgentEvaluation@0 
  inputs: 
    azure-ai-project-endpoint: "<your-ai-project-endpoint>"
    deployment-name: "gpt-4o-mini" 
    data-path: $(Build.SourcesDirectory)\tests\data\golden-dataset-medium.json 
agent-ids: "<your-ai-agent-ids>


trigger: 
- main 
pool: 

  vmImage: 'windows-latest'  

steps: 

- task: AzureCLI@2 
  inputs: 
    addSpnToEnvironment: true 
    azureSubscription: ${{vars.Service_Connection_Name}}
    scriptType: bash 
    scriptLocation: inlineScript     

    inlineScript: | 
      echo "##vso[task.setvariable variable=ARM_CLIENT_ID]$servicePrincipalId"  
      echo "##vso[task.setvariable variable=ARM_ID_TOEKN]$idToken" 
      echo "##vso[task.setvariable variable=ARM_TENANT_ID]$tenantId" 

- bash: | 

   az login --service-principal -u $(ARM_CLIENT_ID) --tenant $(ARM_TENANT_ID) --allow-no-subscriptions --federated-token $(ARM_ID_TOEKN) 

  displayName: 'Login Azure' 
 
- task: UsePythonVersion@0 
  inputs: 
    versionSpec: '3.11' 
- task: AIAgentEvaluation@0 
  inputs: 
    azure-aiproject-connection-string: 'azure-ai-project-connection-string-sample' 
    deployment-name: "gpt-4o-mini" 
    data-path: $(Build.SourcesDirectory)\tests\data\golden-dataset-medium.json 
agent-ids: "<your-ai-agent-ids>

Set up a new pipeline and trigger an evaluation run

Commit and run the pipeline in Azure DevOps.

View results

Select the run and go to "Azure AI Evaluation" tab.
The results are shown in this format:
- The top section summarizes the overview of two AI agent variants. You can select it on the agent ID link, and it directs you to the agent setting page in Microsoft Foundry portal. You can also select the link for Evaluation Results, and it directs you to Foundry portal to view individual result in detail.
- The second section includes evaluation scores and comparison between different variants on statistical significance (for multiple agents) and confidence intervals (for single agent).

Evaluation results and comparisons from multiple AI agents:

Single agent evaluation result:

Feedback

Was this page helpful?

Last updated on 2025-11-18