Agent Evaluation - Weni Fork
Note: This is a fork of the original Agent Evaluation framework by AWS Labs. This fork adds support for testing Weni conversational AI agents while maintaining all the original functionality for AWS services.
Agent Evaluation is a generative AI-powered framework for testing virtual agents.
Internally, Agent Evaluation implements an LLM agent (evaluator) that will orchestrate conversations with your own agent (target) and evaluate the responses during the conversation.
โจ Key features
- ๐ Weni Agent Support: Built-in support for testing Weni conversational AI agents through their API and WebSocket interface.
- Built-in support for popular AWS services including Amazon Bedrock, Amazon Q Business, and Amazon SageMaker. You can also bring your own agent to test using Agent Evaluation.
- Orchestrate concurrent, multi-turn conversations with your agent while evaluating its responses.
- Define hooks to perform additional tasks such as integration testing.
- Can be incorporated into CI/CD pipelines to expedite the time to delivery while maintaining the stability of agents in production environments.
๐ Quick Start with Weni
Installation
Install the package from PyPI:
pip install weni-agenteval
Prerequisites
Important
You need both AWS and Weni credentials to run evaluations!
To test Weni agents, you'll need:
- AWS Credentials: Required for the evaluator (Claude model via Bedrock)
- AWS Access Key ID
- AWS Secret Access Key
-
AWS Session Token
-
A Weni Project: An active project in the Weni platform
-
Weni Authentication: Choose one of the following methods:
๐ Option 1: Weni CLI (Recommended)
Install and authenticate with the Weni CLI:
# Install Weni CLI
pip install weni-cli
# Authenticate with Weni
weni login
# Select your project
weni project use [your-project-uuid]
๐ Option 2: Environment Variables
Set these environment variables manually:
- WENI_PROJECT_UUID: Your project's unique identifier
- WENI_BEARER_TOKEN: Your authentication bearer token
Basic Usage
Create a test configuration file agenteval.yml:
evaluator:
model: claude-haiku-4_5-global # or claude-sonnet-4_5-global, claude-haiku-3_5-us
aws_region: us-east-1
target:
type: weni
tests:
greeting:
steps:
- Send a greeting "Olรก, bom dia!"
- Ask what "com oq vc pode me ajudar?"
expected_results:
- Agent responds with a friendly greeting
- Agent shows up a menu with options to help the user
purchase_outside_postal_code:
steps:
- Ask information "quero comprar arroz"
- Give the postal code "04538-132"
expected_results:
- Agent responds asking for postal code
- Agent says it doesn't deliver to this postal code
Run the evaluation:
weni-agenteval run
For real-time monitoring of conversations, use watch mode:
weni-agenteval run --watch
-
๐ Getting started
Create your first Weni agent test.
-
๐ฏ Weni Target Configuration
Learn how to configure your Weni agent for testing.
-
โ๏ธ Writing test cases
Learn how to write effective test cases for conversational AI.
-
Contribute
Review the contributing guidelines to get started!