Share via


Run automated tests for agent quality and reliability

Enabled for Public preview General availability
Admins, makers, marketers, or analysts, automatically Sep 21, 2025 -

Business value

The evaluation framework enhances agent validation by enabling automated testing workflows, minimizing manual effort, and providing clear execution results. It ensures consistent and reliable agent responses, allowing Makers to identify potential issues early in the development cycle. By offering run results and evaluation indicators, Makers can better assess test coverage, verify execution integrity, and improve overall agent performance, leading to faster deployment and increased reliability.

Feature details

The evaluation framework in Copilot Studio introduces a structured and automated approach to testing AI agents, ensuring high-quality deployments and continuous improvement. It is built around three core workstreams:

  1. Initiating automated evaluation processes Makers can initiate automated evaluation tests seamlessly, either directly from the agent or through the test pane. This enables structured validation workflows, ensuring consistent and repeatable testing.

  2. Advanced test query editing The evaluation framework allows Makers to refine and customize test queries to maximize validation accuracy: • Dynamically modify test queries to adapt to different testing needs • Manually enter custom test questions for expanded scenario coverage • Leverage AI-generated test queries to enhance evaluation depth

  3. Automated test execution and results display The evaluation framework provides a structured and automated testing workflow, ensuring reliable execution and clear validation results: • Execute automated tests to assess agent responses across multiple scenarios • Provide an overall performance summary, helping users quickly gauge evaluation results • Break down results by session to track execution details and agent behavior • Provide detailed question-level feedback, including: o Evaluation of answers and correctness o Explanations for failed tests o Identification of the question source for better traceability

Geographic areas

Visit the Explore Feature Geography report for Microsoft Azure areas where this feature is planned or available.

Language availability

Visit the Explore Feature Language report for information on this feature's availability.

Additional resources

Create test cases to evaluate your agent (docs)