Agent Review¶

Overview¶

Agent Review is a collaborative framework designed to validate agent performance. It enables Agent Builders and Subject Matter Experts (SMEs) to define test cases, execute them against specific agent versions, and perform a detailed audit of the agent’s logic, tool usage, and output quality through both automated LLM-as-a-judge “traits” and human feedback.

Note

The “Agent Review” feature is available to customers with the Advanced LLM Mesh add-on.

Tests¶

Adding tests¶

There are two methods to add tests in the Agent Review:

Manual: Manually define individual test cases. Each case consists of a Query (the user prompt), an optional Reference Answer (the “golden” response), and optional Expectations (specific behavioral guidance or constraints the agent should follow).
Import from dataset: Import an existing dataset and map columns to Query, Reference Answer, and Expectations fields.

Quick Test¶

Available in the side panel, this lightweight mode allows for rapid iteration. It’s useful for refining and automatically populating Reference Answer and Expectations fields before finalizing a test case.

Running tests¶

Select tests to start a new run. The tests will be executed as many times as configured in the settings. Execution results can be reviewed and annotated in the Results tab. Detailed logs for each run are available in the Logs tab.

Traits¶

For automated evaluation, Builders can set up traits based on LLM-as-a-judge. A trait can use data from the agent’s answer, reference answer, and expectations from a test to compute a PASS/FAIL outcome. Traits can be configured in settings.

Review run results¶

The Results tab is the central hub for performance analysis and human evaluation. It allows Builders and Reviewers to audit agent behavior through a combination of traits and manual validation.

Analyze results¶

Click on each individual test to get more details about how the agent performed, including the agent’s answer and the trajectory for each execution of the test.

Test Annotation & Feedback¶

Note

To be able to review run results, the user must have write permission on a project.

There are two types of interactions:

Reviewers can provide a binary rating (PASS/FAIL) and add a comment to a test.
Reviewers can manually override a trait outcome with a binary rating (PASS/FAIL).

Warning

Human feedback always overrides LLM-as-judge evaluation.

Test status¶

Test status aggregates trait outcomes and feedback from team reviews. Human feedback always overrides LLM-as-judge evaluation.

Test Status	Condition (If)
Pass	All traits passed OR Team review overrides: “Passed”
Fail	One trait failed OR Team review overrides: “Failed”
Conflict	At least two team members reviews are different
Empty	Test completed but no trait or human review
Skipped	Test did not complete (encountered an error)

Trait status¶

Trait status aggregates the trait outcome in the different executions of the test as well as the team reviewers’ overrides. Human feedback always overrides LLM-as-judge evaluation.

Trait Status	Condition (If)
Pass	All trait outcomes passed OR All team review overrides: “Passed”
Fail	One trait outcome failed OR At least one team review override: “Failed”
Skipped	Trait computation did not complete (encountered an error)

Compare runs¶

The Compare View is designed for benchmarking agent performance across different versions or configurations. It provides a side-by-side analysis to help users:

Track progress and identify regressions.
Add reviews or override traits directly while comparing runs.

Settings¶

Multiple execution of a test¶

Since Agents are non-deterministic, tests can be configured to run multiple times for variance analysis.

Trait setup¶

Each trait represents criteria evaluated by an LLM. Traits can be added, edited, or deleted. The instance’s default GenAI Evaluation LLM is used unless otherwise specified.

Permissions¶

There are different access levels on an Agent Review:

View

Users with Read project content can fully access the information of an agent review.

Perform action

To perform actions (create tests, start a run, perform a review, etc.), the user needs both the Write project content permission and to have a profile that allows him to review agents.

Example: Users with an AI Consumer profile and Write project content permission will be able to perform actions on a review. However, users with a Reader profile and Write project content will not be able to.

Manage configuration

To manage configuration (change agent, modify traits, etc.), the user needs both the Write project content permission and to have a profile that allows him to modify the project configuration.

Example: A user with a Designer profile and Write project content permission will be able to manage the agent review configuration. However, a user with AI Consumer profile with Write project content permission will not be able to.