Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
[This article is prerelease documentation and is subject to change.]
In Copilot Studio, you can create a test set of test cases to evaluate the performance of your agents. Test cases let you simulate real-world scenarios for your agent, so you can measure the accuracy, relevancy, and quality of answers to the questions the agent is asked, based on the information the agent can access. By using the results from the test set, you can optimize your agent's behavior and validate that your agent meets your business and quality requirements.
Important
This article contains Microsoft Copilot Studio preview documentation and is subject to change.
Preview features aren't meant for production use and may have restricted functionality. These features are available before an official release so that you can get early access and provide feedback.
If you're building a production-ready agent, see Microsoft Copilot Studio Overview.
Test methods
When creating test sets, you can choose from different test methods to evaluate your agent's responses: text match, similarity, and quality. Each test method has its own strengths and is suited for different types of evaluations.
Text match test methods
Text match test methods compare the agent’s responses to expected responses that you define in the test set. There are two match tests:
Exact match checks whether the agent’s answer exactly matches the expected response in the test: character for character, word for word. If it’s the same, it passes. If anything differs, it fails. Exact match is useful for short, precise answers such as numbers, codes, or fixed phrases. It doesn't suit answers that people can phrase in multiple correct ways.
Partial match checks whether the agent’s answer contains some of the words or phrases from the expected response that you define. If it does, it passes. If it doesn’t, it fails. Partial match is useful when an answer can be phrased in different correct ways, but key terms or ideas still need to be included in the response.
Similarity test methods
The similarity test method compares the similarity of the agent’s responses to the expected responses defined in your test set. It's useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.
It uses a cosine similarity metric to assess how similar the agent's answer is to the wording and meaning of the expected response and determines a score. The score ranges between 0 and 1, where 1 indicates the answer closely matches and 0 indicates it doesn't. You can set a passing score threshold to determine what constitutes a passing score for an answer.
Quality test methods
Quality test methods help you decide whether your agent's responses meet your standards. This approach ensures the results are both reliable and easy to explain.
These methods use a large language model (LLM) to assess how effectively an agent answers user questions. They're especially helpful when there's no exact answer expected, offering a flexible and scalable way to evaluate responses based on the retrieved documents and the conversation flow.
Quality test methods include two test methods:
General quality evaluates agent responses. It uses these key criteria and applies a consistent prompt to guide scoring:
Relevance: To what extent the agent's response addresses the question. For example, does the agent's response stay on the subject and directly answer the question?
Groundedness: To what extent the agent's response is based on the provided context. For example, does the agent's response reference or rely on the information given in the context, rather than introducing unrelated or unsupported information?
Completeness: To what extent the agent's response provides all necessary information. For example, does the agent's response cover all aspects of the question and provide sufficient detail?
Abstention: Whether the agent attempted to answer the question.
To be considered high quality, a response must meet all these key criteria. If one criterion isn't met, the response is flagged for improvement. This scoring method ensures that only responses that are both complete and well-supported receive top marks. In contrast, answers that are incomplete or lack supporting evidence receive lower scores.
Compare meaning evaluates how well the agent's answer reflects the intended meaning of the expected response. Instead of focusing on exact wording, it uses intent similarity, meaning it compares the ideas and meaning behind the words, to judge how closely the response aligns with what was expected.
You can set a passing score threshold to determine what constitutes a passing score for an answer. The default passing score is 50. The compare meaning test method is useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.
Thresholds and pass rates
The success of a test case depends on the test method you select and the threshold you set for passing scores.
Each test method, except exact match, produces a numeric score based on a set of evaluation criteria that reflects how well the agent's answer meets that criteria. The threshold is the cut-off score that separates pass from fail. You can set the passing scores for similarity and compare meaning test cases.
Exact match is a strict test method that doesn't produce a numeric score; the answer must match exactly to pass. By choosing the threshold for a test case, you decide how strict or lenient the evaluation is. Each test method evaluates the agent's answer differently, so it's important to choose the one that best fits your evaluation criteria.