Agent Evaluation
Agent Evaluation allows you to automatically test your Weni agents by defining test plans with steps and expected results. An evaluator interacts with your agent and judges whether the responses meet the expected criteria.
How it works
The evaluation flow follows these stages:
- Initialization: The evaluator reads your test plan (
agent_evaluation.yml) - Test execution: For each test case, the evaluator sends prompts to your agent and collects responses
- Judgment: The evaluator analyzes the conversation and determines if the expected results were observed
- Report: Results are displayed in a summary table and a markdown report is saved
Getting started
Prerequisites
Before running evaluations, make sure you have:
- Weni CLI installed and authenticated (
weni login) - A project selected (
weni project use <project-uuid>) - An active agent configured in your Weni project
Initialize an evaluation plan
Create a default agent_evaluation.yml file in the current directory:
You can also specify a directory:
This generates a starter plan like:
tests:
greeting:
steps:
- Send a greeting message to the agent
expected_results:
- Agent responds with a friendly greeting
You only need to define your test scenarios. The evaluator model and authentication are automatically handled by the Weni CLI.
Plan file structure
The agent_evaluation.yml file contains your test definitions. Each test has a unique key and contains:
| Field | Type | Required | Description |
|---|---|---|---|
steps |
list of strings | Yes | The sequence of actions/messages to send to the agent. |
expected_results |
list of strings | Yes | The criteria used to judge the agent's responses. |
Writing tests
Single-turn test
A simple test with one message and one expected result:
tests:
greeting:
steps:
- Send a greeting "Hello!"
expected_results:
- Agent responds with a friendly greeting
Multi-turn test
A test with multiple messages to verify conversation context:
tests:
multi_turn_conversation:
steps:
- Ask "What are your business hours?"
- Follow up with "And on weekends?"
expected_results:
- Agent provides business hours for weekdays
- Agent maintains context and provides weekend hours
Multiple expected results
You can define multiple criteria that must all be met:
tests:
product_inquiry:
steps:
- Ask "What products do you offer?"
expected_results:
- Agent provides information about available products
- Response includes clear product descriptions
- Agent offers to help with specific product questions
Error handling test
Test how the agent handles unexpected inputs:
tests:
error_handling:
steps:
- Send an unclear message "xyz123 !!!"
expected_results:
- Agent handles the unclear input gracefully
- Agent asks for clarification or provides guidance
Running evaluations
Run all tests
Run specific tests
Use --filter to run only selected tests (comma-separated):
Verbose output
Use --verbose to see detailed reasoning for each test result:
Custom plan directory
If your plan file is in a different directory:
Understanding results
After running, the CLI displays a results table:
| Test | Status |
|---|---|
| greeting | PASS |
| product_inquiry | PASS |
| error_handling | FAIL |
A markdown summary report is automatically saved to the evaluation_results/ directory with a timestamp (e.g., summary_20260326_190242.md).
When using --verbose, the reasoning column shows the evaluator's explanation for each test verdict.
Complete example
tests:
greeting:
steps:
- Send "Hello, good morning!"
expected_results:
- Agent responds with a friendly greeting
- Agent introduces itself or explains its capabilities
product_inquiry:
steps:
- Ask "What products do you have available?"
expected_results:
- Agent provides information about available products
- Response includes clear product descriptions or categories
multi_turn_conversation:
steps:
- Ask "What are your business hours?"
- Follow up with "And on weekends?"
expected_results:
- Agent provides business hours for weekdays
- Agent maintains context and provides weekend hours
- Responses are coherent and contextual
error_handling:
steps:
- Send an unclear message "xyz123 !!!"
expected_results:
- Agent handles the unclear input gracefully
- Agent asks for clarification or provides guidance
Troubleshooting
"Could not find agent_evaluation.yml"
- Make sure you ran
weni eval initfirst, or specify the correct directory with--plan-dir.
Authentication errors (401)
- Run
weni loginto refresh your token. - Verify you have a project selected with
weni project current.
Evaluation timeout
- Your agent may need more time to respond. Check if the agent is active and properly configured in the Weni platform.
Tests failing unexpectedly
- Use
--verboseto see the evaluator's reasoning. - Make sure your
expected_resultsare clear and specific. - Verify your agent is responding correctly by testing it manually in the Weni platform first.