Choose evaluation methods

[This article is prerelease documentation and is subject to change.]

When creating test sets, you can choose from different test methods to evaluate your agent's responses: text match, similarity, and quality. Each test method has its own strengths and is suited for different types of evaluations.

Text match test methods

Text match test methods compare the agent’s responses to expected responses that you define in the test set. There are two match tests:

Exact match checks whether the agent’s answer exactly matches the expected response in the test: character for character, word for word. If it’s the same, it passes. If anything differs, it fails. Exact match is useful for short, precise answers such as numbers, codes, or fixed phrases. It doesn't suit answers that people can phrase in multiple correct ways.

Keyword match checks whether the agent’s answer contains some of the words or phrases from the expected response that you define. If it does, it passes. If it doesn’t, it fails. Keyword match is useful when an answer can be phrased in different correct ways, but key terms or ideas still need to be included in the response.

Similarity test methods

The similarity test method compares the similarity of the agent’s responses to the expected responses you define in your test set. It's useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

It uses a cosine similarity metric to assess how similar the agent's answer is to the wording and meaning of the expected response and determines a score. The score ranges between 0 and 1, where 1 indicates the answer closely matches and 0 indicates it doesn't. You can set a passing score threshold to determine what constitutes a passing score for an answer.

Quality test methods

Quality test methods help you decide whether your agent's responses meet your standards. This approach ensures the results are both reliable and easy to explain.

These methods use a large language model (LLM) to assess how effectively an agent answers user questions. They're especially helpful when there's no exact answer expected, offering a flexible and scalable way to evaluate responses based on the retrieved documents and the conversation flow.

Quality test methods include two test methods:

General quality evaluates agent responses. It uses these key criteria and applies a consistent prompt to guide scoring:

Relevance: To what extent the agent's response addresses the question. For example, does the agent's response stay on the subject and directly answer the question?
Groundedness: To what extent the agent's response is based on the provided context. For example, does the agent's response reference or rely on the information given in the context, rather than introducing unrelated or unsupported information?
Completeness: To what extent the agent's response provides all necessary information. For example, does the agent's response cover all aspects of the question and provide sufficient detail?
Abstention: Whether the agent attempted to answer the question.

To be considered high quality, a response must meet all these key criteria. If one criterion isn't met, the response is flagged for improvement. This scoring method ensures that only responses that are both complete and well-supported receive top marks. In contrast, answers that are incomplete or lack supporting evidence receive lower scores.

Compare meaning evaluates how well the agent's answer reflects the intended meaning of the expected response. Instead of focusing on exact wording, it uses intent similarity, meaning it compares the ideas and meaning behind the words, to judge how closely the response aligns with what was expected.

You can set a passing score threshold to determine what constitutes a passing score for an answer. The default passing score is 50. The compare meaning test method is useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

Thresholds and pass rates

The success of a test case depends on the test method you select and the threshold you set for passing scores.

Each test method, except exact match, produces a numeric score based on a set of evaluation criteria. This score reflects how well the agent's answer meets those criteria. The threshold is the cut-off score that separates pass from fail. You can set the passing scores for similarity and compare meaning test cases.

Exact match is a strict test method that doesn't produce a numeric score. The answer must match exactly to pass. By choosing the threshold for a test case, you decide how strict or lenient the evaluation is. Each test method evaluates the agent's answer differently, so it's important to choose the one that best fits your evaluation criteria.

Feedback

Was this page helpful?

Last updated on 2026-01-15