Agent Review¶
Overview¶
Agent Review is a collaborative framework designed to validate agent performance. It enables Agent Builders and Subject Matter Experts (SMEs) to define test cases, execute them against specific agent versions, and perform a detailed audit of the agent’s logic, tool usage, and output quality through both automated LLM-as-a-judge “traits” and human feedback.
Note
The “Agent Review” feature is available to customers with the Advanced LLM Mesh add-on.
Tests¶
Adding tests¶
There are two methods to add tests in the Agent Review:
Manual: Manually define individual test cases. Each case consists of a Query (the user prompt), an optional Reference Answer (the “golden” response), and optional Expectations (specific behavioral guidance or constraints the agent should follow).
Import from dataset: Import an existing dataset and map columns to Query, Reference Answer, and Expectations fields.
Quick Test¶
Available in the side panel, this lightweight mode allows for rapid iteration. It’s useful for refining and automatically populating Reference Answer and Expectations fields before finalizing a test case.
Running tests¶
Select tests to start a new run. The tests will be executed as many times as configured in the settings. Execution results can be reviewed and annotated in the Results tab. Detailed logs for each run are available in the Logs tab.
Traits¶
For automated evaluation, Builders can set up traits based on LLM-as-a-judge. A trait can use data from the agent’s answer, reference answer, and expectations from a test to compute a PASS/FAIL outcome. Traits can be configured in settings.
Review run results¶
The Results tab is the central hub for performance analysis and human evaluation. It allows Builders and Reviewers to audit agent behavior through a combination of traits and manual validation.
Analyze results¶
Click on each individual test to get more details about how the agent performed, including the agent’s answer and the trajectory for each execution of the test.
Test Annotation & Feedback¶
Note
To be able to review run results, the user must have write permission on a project.
There are two types of interactions:
Reviewers can provide a binary rating (PASS/FAIL) and add a comment to a test.
Reviewers can manually override a trait outcome with a binary rating (PASS/FAIL).
Warning
Human feedback always overrides LLM-as-judge evaluation.
Test status¶
Test status aggregates trait outcomes and feedback from team reviews. Human feedback always overrides LLM-as-judge evaluation.
Test Status |
Condition (If) |
|---|---|
Pass |
All traits passed OR Team review overrides: “Passed” |
Fail |
One trait failed OR Team review overrides: “Failed” |
Conflict |
At least two team members reviews are different |
Empty |
Test completed but no trait or human review |
Skipped |
Test did not complete (encountered an error) |
Trait status¶
Trait status aggregates the trait outcome in the different executions of the test as well as the team reviewers’ overrides. Human feedback always overrides LLM-as-judge evaluation.
Trait Status |
Condition (If) |
|---|---|
Pass |
All trait outcomes passed OR All team review overrides: “Passed” |
Fail |
One trait outcome failed OR At least one team review override: “Failed” |
Skipped |
Trait computation did not complete (encountered an error) |
Compare runs¶
The Compare View is designed for benchmarking agent performance across different versions or configurations. It provides a side-by-side analysis to help users:
Track progress and identify regressions.
Add reviews or override traits directly while comparing runs.
Settings¶
Multiple execution of a test¶
Since Agents are non-deterministic, tests can be configured to run multiple times for variance analysis.
Trait setup¶
Each trait represents criteria evaluated by an LLM. Traits can be added, edited, or deleted. The instance’s default GenAI Evaluation LLM is used unless otherwise specified.
Permissions¶
There are different access levels on an Agent Review:
- View
Users with
Read project contentcan fully access the information of an agent review.- Perform action
To perform actions (create tests, start a run, perform a review, etc.), the user needs both the
Write project contentpermission and to have a profile that allows him to review agents.Example: Users with an
AI Consumerprofile andWrite project contentpermission will be able to perform actions on a review. However, users with aReaderprofile andWrite project contentwill not be able to.- Manage configuration
To manage configuration (change agent, modify traits, etc.), the user needs both the
Write project contentpermission and to have a profile that allows him to modify the project configuration.Example: A user with a
Designerprofile andWrite project contentpermission will be able to manage the agent review configuration. However, a user withAI Consumerprofile withWrite project contentpermission will not be able to.