Edit

Share via


Evaluate generative AI models and applications by using Microsoft Foundry

Note

This document refers to the Microsoft Foundry (classic) portal.

🔄 Switch to the Microsoft Foundry (new) documentation if you're using the new portal.

Note

This document refers to the Microsoft Foundry (new) portal.

To thoroughly assess the performance of your generative AI models and applications on a substantial dataset, initiate an evaluation process. During this evaluation, the model or application is tested with the given dataset, and its performance is measured using mathematical metrics and AI-assisted metrics. This evaluation run provides comprehensive insights into the application's capabilities and limitations.

Use the evaluation functionality in the Microsoft Foundry portal, a platform that offers tools and features for assessing the performance and safety of generative AI models. In the Foundry portal, log, view, and analyze detailed evaluation metrics.

This article explains how to create an evaluation run against a model, agent, or test dataset using built-in evaluation metrics from the Foundry UI. For greater flexibility, you can establish a custom evaluation flow and employ the custom evaluation feature. Use the custom evaluation feature to conduct a batch run without evaluation.

Prerequisites

  • A test dataset in one of these formats: CSV or JSON Lines (JSONL).
  • An Azure OpenAI connection with a deployment of one of these models: a GPT-3.5 model, a GPT-4 model, or a Davinci model. This is required only for AI-assisted quality evaluations.
  • A test dataset in one of these formats: A model, agent, or test dataset in one of these formats: CSV or JSON Lines (JSONL).
  • An Azure OpenAI connection. A deployment of one of these models: a GPT-3.5 model, a GPT-4 model, or a Davinci model. Required only when you run AI-assisted quality evaluations.

Create an evaluation with built-in evaluation metrics

An evaluation run lets you generate metric outputs for each data row in your test dataset. Select one or more evaluation metrics to assess the output from different aspects. Create an evaluation run from the evaluation or model catalog pages in the Foundry portal. The evaluation creation wizard guides you through setting up an evaluation run.

From the evaluate page

From the left pane, select Evaluation > Create a new evaluation.

From the left pane, select Evaluation > Create.

From the model catalog page

  1. From the left pane, select Model catalog.

  2. Go to the model.

  3. Select the Benchmarks tab.

  4. Select Try with your own data. This selection opens the model evaluation panel, where you can create an evaluation run against your selected model.

    Screenshot of the Try with your own data button from the model catalog page.

From the model or agent playground page

From the playground page for models or agent playground page, select Evaluation > Create or select Metrics > Run full evaluation.

Evaluation target

When you start an evaluation from the Evaluate page, choose the evaluation target. Specifying the appropriate evaluation target tailors the evaluation to your application's specific nature, ensuring accurate and relevant metrics. We support two types of evaluation targets:

  • Model: This choice evaluates the output generated by your selected model and user-defined prompt.
  • Dataset: Your model-generated outputs are already in a test dataset.

When you start an evaluation from the Evaluate page, you first need to choose the evaluation target. By specifying the appropriate evaluation target, we can tailor the evaluation to the specific nature of your application, ensuring accurate and relevant metrics. We support three types of evaluation targets:

  • Model: This choice evaluates the output generated by your selected model and user-defined prompt.
  • Agent: This choice evaluates the output generated by your selected agent and user-defined prompt
  • Dataset: Your model or agent-generated outputs are already in a test dataset.

Configure test data

In the evaluation creation wizard, select from preexisting datasets or upload a new dataset to evaluate. The test dataset needs to have the model-generated outputs to be used for evaluation. A preview of your test data is shown on the right pane.

  • Choose existing dataset: You can select the test dataset from your established dataset collection.

    Screenshot of the option to select test data when creating a new evaluation.

  • Add new dataset: Upload files from your local storage. Only CSV and JSONL file formats are supported. A preview of your test data displays on the right pane.

    Screenshot of the upload file option that you can use when creating a new evaluation.

Select or Create a Dataset

If you choose to evaluate a model or agent, you need a dataset to act as inputs to these targets so that responses can be assessed by evaluators. In the dataset step, you can choose to select or upload a dataset of your own, or you can synthetically generate a dataset.

  • Add new dataset: Upload files from your local storage. Only CSV and JSONL file formats are supported. A preview of your test data displays on the right pane.
  • Synthetic Dataset Generation: Synthetic datasets are useful in situations where you either lack data or lack access to data to test the model or agent you’ve built. With synthetic data generation, you choose the resource to generate the data, the number of rows you would like to generate, and must enter a prompt describing the type of data you would like to generate. Additionally, you can upload files to improve the relevance of your dataset to the desired task of your agent or model

Note

This feature isn't available in all regions. Synthetic data generation is available in regions supporting Response API. For an up to date list of supporting regions, see Azure OpenAI Responses API region availability.

Configure testing criteria

We support three types of metrics curated by Microsoft to facilitate a comprehensive evaluation of your application:

  • AI quality (AI assisted): These metrics evaluate the overall quality and coherence of the generated content. You need a model deployment as judge to run these metrics.
  • AI quality (NLP): These natural language processing (NLP) metrics are mathematical-based, and they also evaluate the overall quality of the generated content. They often require ground truth data, but they don't require a model deployment as judge.
  • Risk and safety metrics: These metrics focus on identifying potential content risks and ensuring the safety of the generated content.

You can also create custom metrics and select them as evaluators during the testing criteria step.

As you add your testing criteria, different metrics are going to be used as part of the evaluation. You can refer to the table for the complete list of metrics we offer support for in each scenario. For more in-depth information on metric definitions and how they're calculated, see What are evaluators?.

AI quality (AI assisted) AI quality (NLP) Risk and safety metrics
Groundedness, Relevance, Coherence, Fluency, GPT similarity F1 score, ROUGE score, BLEU score, GLEU score, METEOR score Self-harm-related content, Hateful and unfair content, Violent content, Sexual content, Protected material, Indirect attack

When you run AI-assisted quality evaluation, you must specify a GPT model for the calculation/grading process.

Screenshot that shows the Likert-scale evaluator with the AI quality (AI assisted) metrics listed in presets.

AI Quality (NLP) metrics are mathematically based measurements that assess your application's performance. They often require ground truth data for calculation. ROUGE is a family of metrics. You can select the ROUGE type to calculate the scores. Various types of ROUGE metrics offer ways to evaluate the quality of text generation. ROUGE-N measures the overlap of n-grams between the candidate and reference texts.

Screenshot that shows text similarity with the AI quality (NLP) metrics listed in presets.

For risk and safety metrics, you don't need to provide a deployment. The Foundry portal provisions a GPT-4 model that can generate content risk severity scores and reasoning to enable you to evaluate your application for content harms.

Note

AI-assisted risk and safety metrics are hosted by Foundry safety evaluations and are available only in the following regions: East US 2, France Central, UK South, Sweden Central.

Screenshot that shows the metric Violent content, which is one of the risk and safety metrics.

Caution

Users who previously managed their model deployments and ran evaluations by using oai.azure.com, and then onboarded to the Microsoft Foundry developer platform, have these limitations when they use ai.azure.com:

  • These users can't view their evaluations that were created through the Azure OpenAI API. To view these evaluations, they have to go back to oai.azure.com.
  • These users can't use the Azure OpenAI API to run evaluations within Foundry. Instead, they should continue to use oai.azure.com for this task. However, they can use the Azure OpenAI evaluators that are available directly in Foundry (ai.azure.com) in the option for dataset evaluation creation. The option for fine-tuned model evaluation isn't supported if the deployment is a migration from Azure OpenAI to Foundry.

For the scenario of dataset upload and bring your own storage, there are a few configuration requirements:

  • Account authentication must be Microsoft Entra ID.
  • The storage must be added to the account. Adding it to the project causes service errors.
  • Users must add their project to their storage account through access control in the Azure portal.

To learn more about creating evaluations with OpenAI evaluation graders in the Azure OpenAI hub, see How to use Azure OpenAI in Foundry models evaluation.

Data mapping

Data mapping for evaluation: For each metric added, you must specify which data columns in your dataset correspond with the inputs that are needed in the evaluation. Different evaluation metrics demand distinct types of data inputs for accurate calculations.

During evaluation, the model's response is assessed against key inputs such as:

  • Query: Required for all metrics.
  • Context: Optional.
  • Ground truth: Optional, required for AI quality (NLP) metrics.

These mappings ensure accurate alignment between your data and the evaluation criteria.

Screenshot of the query, context, and ground truth mapping to your evaluation input.

Data mapping for evaluation: Different evaluation metrics demand distinct types of data inputs for accurate calculations.

Based on the dataset you’ve generated or uploaded, we'll automatically map those dataset fields to the fields present in the evaluators. However, you should always double check the field mapping to make sure it's accurate. You can reassign fields if needed.

Query and response metric requirements

For guidance on the specific data mapping requirements for each metric, refer to the information provided in the table:

Metric Query Response Context Ground truth
Groundedness Required: Str Required: Str Required: Str Doesn't apply
Coherence Required: Str Required: Str Doesn't apply Doesn't apply
Fluency Required: Str Required: Str Doesn't apply Doesn't apply
Relevance Required: Str Required: Str Required: Str Doesn't apply
GPT-similarity Required: Str Required: Str Doesn't apply Required: Str
F1 score Doesn't apply Required: Str Doesn't apply Required: Str
BLEU score Doesn't apply Required: Str Doesn't apply Required: Str
GLEU score Doesn't apply Required: Str Doesn't apply Required: Str
METEOR score Doesn't apply Required: Str Doesn't apply Required: Str
ROUGE score Doesn't apply Required: Str Doesn't apply Required: Str
Self-harm-related content Required: Str Required: Str Doesn't apply Doesn't apply
Hateful and unfair content Required: Str Required: Str Doesn't apply Doesn't apply
Violent content Required: Str Required: Str Doesn't apply Doesn't apply
Sexual content Required: Str Required: Str Doesn't apply Doesn't apply
Protected material Required: Str Required: Str Doesn't apply Doesn't apply
Indirect attack Required: Str Required: Str Doesn't apply Doesn't apply
  • Query: A query seeking specific information.
  • Response: The response to a query generated by the model.
  • Context: The source that the response is based on. (Example: grounding documents.)
  • Ground truth: A query response generated by a human user that serves as the true answer.

Review and submit

After completing the necessary configurations, provide an optional name for your evaluation. Review the settings and select Submit to start the evaluation run.

After you complete all the necessary configurations, you can provide a name for your evaluation. Then you can review and select Submit to submit the evaluation run.

Model evaluation

To create a new evaluation for your selected model deployment, you can use a GPT model to generate sample questions, or you can select from your established dataset collection.

Configure test data for a model

Set up the test dataset that's used for evaluation. This dataset is sent to the model to generate responses for assessment. You have two options for configuring your test data:

  • Generate sample questions
  • Use an existing dataset (or upload a new dataset)
Generate sample questions

If you don't have a dataset readily available and want to run an evaluation with a small sample, select the model deployment that you want to evaluate based on a chosen topic. Azure OpenAI models and other open models that are compatible with serverless API deployment, like Meta Llama and Phi-3 family models, are supported.

The topic tailors the generated content to your area of interest. Queries and responses are generated in real time and you can regenerate them as needed.

Use your dataset

You can also select from your established dataset collection or upload a new dataset.

Screenshot that shows Select data source and highlights using an existing dataset.

Select evaluation metrics

To configure your test criteria, select Next. As you select your criteria, metrics are added, and you need to map your dataset's columns to the required fields for evaluation. These mappings ensure accurate alignment between your data and the evaluation criteria.

After you select the test criteria you want, you can review the evaluation, optionally change the name of the evaluation, and then select Submit. Go to the evaluation page to see the results.

Note

The generated dataset is saved to the project's blob storage after the evaluation run is created.

View and manage the evaluators in the evaluator library

See the details and status of your evaluators in one place in the evaluator library. View and manage Microsoft-curated evaluators.

The evaluator library also enables version management. You can compare different versions of your work, restore previous versions if needed, and collaborate with others more easily.

To use the evaluator library in the Foundry portal, go to your project's Evaluation page and select the Evaluator library tab.

Select the evaluator name to see more details, including the name, description, parameters, and any associated files. Here are some examples of Microsoft-curated evaluators:

  • For performance and quality evaluators curated by Microsoft, view the annotation prompt on the details page. Adapt these prompts to your use case. Change the parameters or criteria based on your data and objectives in the Azure AI Evaluation SDK. For example, you can select Groundedness-Evaluator and check the Prompty file that shows how we calculate the metric.
  • For risk and safety evaluators curated by Microsoft, see the definition of the metrics. For example, select Self-Harm-Related-Content-Evaluator to learn what it means and understand how Microsoft determines severity levels.

Learn more about evaluating your generative AI applications: