Agent Evaluation - Weni Fork

Note: This is a fork of the original Agent Evaluation framework by AWS Labs. This fork adds support for testing Weni conversational AI agents while maintaining all the original functionality for AWS services.

Agent Evaluation is a generative AI-powered framework for testing virtual agents.

Internally, Agent Evaluation implements an LLM agent (evaluator) that will orchestrate conversations with your own agent (target) and evaluate the responses during the conversation.

✨ Key features

🆕 Weni Agent Support: Built-in support for testing Weni conversational AI agents through their API and WebSocket interface.
Built-in support for popular AWS services including Amazon Bedrock, Amazon Q Business, and Amazon SageMaker. You can also bring your own agent to test using Agent Evaluation.
Orchestrate concurrent, multi-turn conversations with your agent while evaluating its responses.
Define hooks to perform additional tasks such as integration testing.
Can be incorporated into CI/CD pipelines to expedite the time to delivery while maintaining the stability of agents in production environments.

🚀 Quick Start with Weni

Installation

Install the package from PyPI:

pip install weni-agenteval

Prerequisites

Important

You need both AWS and Weni credentials to run evaluations!

To test Weni agents, you'll need:

AWS Credentials: Required for the evaluator (Claude model via Bedrock)
AWS Access Key ID
AWS Secret Access Key
AWS Session Token
A Weni Project: An active project in the Weni platform
Weni Authentication: Choose one of the following methods:

🚀 Option 1: Weni CLI (Recommended)

Install and authenticate with the Weni CLI:

# Install Weni CLI
pip install weni-cli

# Authenticate with Weni
weni login

# Select your project
weni project use [your-project-uuid]

📋 Option 2: Environment Variables

Set these environment variables manually: - WENI_PROJECT_UUID: Your project's unique identifier - WENI_BEARER_TOKEN: Your authentication bearer token

Basic Usage

Create a test configuration file agenteval.yml:

evaluator:
  model: claude-haiku-4_5-global  # or claude-sonnet-4_5-global, claude-haiku-3_5-us
  aws_region: us-east-1

target:
  type: weni

tests:
  greeting:
    steps:
      - Send a greeting "Olá, bom dia!"
      - Ask what "com oq vc pode me ajudar?"
    expected_results:
      - Agent responds with a friendly greeting
      - Agent shows up a menu with options to help the user

  purchase_outside_postal_code:
    steps:
      - Ask information "quero comprar arroz"
      - Give the postal code "04538-132"
    expected_results:
      - Agent responds asking for postal code
      - Agent says it doesn't deliver to this postal code

Run the evaluation:

weni-agenteval run

For real-time monitoring of conversations, use watch mode:

weni-agenteval run --watch

🚀 Getting started

Create your first Weni agent test.

User Guide
🎯 Weni Target Configuration

Learn how to configure your Weni agent for testing.

Weni Target
✏️ Writing test cases

Learn how to write effective test cases for conversational AI.

User Guide
Contribute

Review the contributing guidelines to get started!

GitHub