Share via


Analyze test results using Copilot Studio Kit

The Copilot Studio Kit provides a comprehensive interface for analyzing test results.

Test run details

The Agent Test Run interface shows the status of test runs.

Status Description
Run Status Main process that runs each individual agent test against the agent configuration by using the Direct Line API, and creates a corresponding Agent Test Result record.
App Insights Enrichment Status Runs only if Enrich With Azure Application Insights is enabled on the related Agent Configuration record.
Generated Answers Analysis Runs only if Analyze Generated Answers is enabled on the related Agent Configuration record.
Dataverse Enrichment Status Runs only if Enrich With Conversation Transcripts is enabled on the related Agent Configuration record.

Learn more about Agent Configuration settings in Configure agents in Copilot Studio Kit.

The following image shows the Test Runs interface, where you can view details of the test run.

Screenshot of the Test Runs interface in Copilot Studio Kit, showing details such as Run Status, Success Rate, Average Latency, and more.

Aggregated results

After a cloud flow runs, the system calculates the aggregated results.

Aggregated result Description
# Tests Number of test results.
Success Rate (%) Percentage of test result records with a Success result compared to the total number of test results.
Average Latency (ms) Average time, in milliseconds, for the agent to send the message after it receives the test utterance.
# Success Number of test result records with a Success result.
# Failed Number of test result records with a Failed result.
# Pending Number of test result records with a Pending result.
# Unknown Number of test result records with an Unknown result.
# Error Number of test result records with an Error result.

Detailed results

Analyze results after you complete each step, as some results are only available after the steps finish. For example, Topic Match tests need Dataverse enrichment to fully run, as only this step provides information on the topic name that was triggered.

You can edit the results view to adjust results individually.

Each result has a Result Reason section that's automatically populated with an explanation for the result. For AI-generated assessments, it recommends a human review: "AI-generated assessment of the response. Please review." Testers can use this attribute to add their own comments and notes on a test.

Screenshot of an Agent Test Run record showing the Result Reason column on the right hand side of the interface.

For each of the following test types, you can use the Results filter to view only the results of a specific type:

  • Generative Answers Results
  • Response Match Results
  • Topic Match Results
  • Attachment Results

Screenshot of the System View options available for Results.

Agent Test Result details

The Agent Test Result form provides details on each individual test execution. The system automatically creates these records.

Column Name Description
Conversation ID ID of the conversation that the Direct Line API provides.
Agent Test Run Test run that the record relates to.
Agent Test Test that the record relates to. You can see the test details in a Quick View form.
Result Result: Success, Failed, Unknown, Error, Pending.
Explanation Autogenerated explanation of the result.
Latency (ms) Time, in milliseconds, that the agent takes to send the message back after receiving the test utterance.
Message Sent Timestamp of the message that the user sends.
Response Received Timestamp of the message that the agent sends.
Response Text message the agent sends.
App Insights Result Generative answer results from Azure Application Insights (when Enrich With Azure Application Insights is enabled).
Triggered Topic ID Unique identifier of the Chatbot Subcomponent record for the triggered topic in Dataverse (when Enrich with Conversation Transcripts is enabled).
Triggered Topic / Event Name of the triggered topic (when Enrich With Conversation Transcripts is enabled).
If multiple topics matched, IntentCandidates. For Conversational Boosting and Fallback, UnknownIntent.
Recognized Intent Score If intent recognition occurs, the score of the top intent.
Conversation Transcript File attachment of the full conversation transcript JSON (when Enrich with Conversation Transcripts is enabled and Copy Full Transcript is set to yes).
Suggested Actions When available, JSON of the suggested actions that the agent returns and associates with its response.
Attachments When available, JSON of the attachments array that the agent returns and associates with its response.
Citations For generated answers, JSON array of the citations that the agent uses to generate the answer (when Enrich with Conversation Transcripts is enabled).

Inspect the transcript

If you enable Enrich With Conversation Transcripts and set Copy Full Transcript to yes, the test result includes the full transcript. When you analyze a test result, go to the Transcript tab for a detailed transcript view in JSON format with an accompanying visualization.

Screenshot of the Transcript analysis interface of an Agent Test Result.

Analyze multi-turn test results

The results view shows multi-turn tests along with other test types. You see their overall result (Success or Failed) in the Result column. Select the Conversation ID value to view details for the multi-turn test and a list of child tests that make up the test.

Screenshot of the Multiturn Test Results detail view of an Agent Test Result.

In the detailed view of Multiturn Test Results, you can see results of individual child tests and drill down into their details. The result of a multi-turn test depends on results of its child tests that are marked as critical. Noncritical child tests can fail, and the multi-turn test case continues to the next test case. If any of the critical child tests fail, test execution for that multi-turn stops and the test is marked as Failed. If all the critical child tests pass, the result of the multi-turn test is Success.

Multi-turn test cases can include noncritical tests because they provide information to the generative orchestrator. The exact response to the test case doesn't matter, just the critical tests that follow.

The multi-turn test (and the Multiturn Test Result) can include any of the regular test types: Response match, Attachments, Topic Match, and Generative Answers.

Where to get help

If you experience issues, review the troubleshooting guidance or raise a support request on GitHub.