awesome-ai-agent-testing

github.com/chaosync-org/awesome-ai-agent-testing ↗

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

GitHub Stars

155

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me videos and courses resources from awesome-ai-agent-testing"

Installation instructions →

What's inside

Practical Resources

Advanced Testing TechniquesVideos and Courses
MIT OpenCourseWare
Agent Testing Best PracticesTutorials and Guides
Industry guidelines
Agent Testing ExamplesCode Repositories
Collection of test cases
AI Agents Testing 101Tutorials and Guides
Beginner's guide to agent testing
AI Agent Testing FundamentalsVideos and Courses
6-hour comprehensive course
Air Canada Chatbot Hallucination CaseCase Studies
Chatbot provided incorrect refund policy leading to legal liability.

Safety and Security Testing

Adversarial Robustness ToolboxAdversarial Testing
IBM's toolkit for ML security
AI Safety BenchmarkRed Teaming
Comprehensive safety evaluation
AI Safety GridworldsSafety Evaluation
DeepMind's safety testing environments
Alignment Research Center EvalsSafety Evaluation
Alignment-focused evaluations
Anthropic Red Team DatasetRed Teaming
Curated red team prompts
CleverHansAdversarial Testing
Library for adversarial example generation

Benchmarks and Evaluation

AgentBenchDatasets
Comprehensive benchmark across 8 distinct environments with 27+ models tested.
AgentBench LeaderboardLeaderboards
Multi-environment agent rankings
ALFWorldDatasets
Text-based embodied agents in interactive environments.
BIG-benchLeaderboards
Beyond the Imitation Game benchmark
EleutherAI LM Evaluation HarnessEvaluation Frameworks
Framework for few-shot evaluation
GAIA BenchmarkDatasets
General AI Assistant benchmark for fundamental agent capabilities.

Foundations

AgentBench: Evaluating LLMs as AgentsAcademic Papers
Comprehensive benchmark suite for evaluating LLM-based agents across diverse environments.
Artificial Intelligence: A Modern ApproachBooks and Textbooks
Classic textbook with chapters on agent testing and evaluation.
A Survey of LLM-based Autonomous AgentsSurveys and Reviews
Extensive survey covering construction, application, and evaluation of LLM-based autonomous agents.
A Survey on Evaluation of Large Language Model Based AgentsSurveys and Reviews
Systematic review of evaluation methods for LLM-based agents.
Benchmarking of AI Agents: A PerspectiveSurveys and Reviews
Industry perspective on the critical role of benchmarking in accelerating AI agent adoption.
Evaluating AI Agent Performance With BenchmarksAcademic Papers
Comprehensive guide on evaluating AI agents in real-world scenarios with practical examples and metrics.

Testing Frameworks

Agent-Testing-LibraryLanguage-Specific Tools
Testing utilities for JS agents.
AgentTestKitLanguage-Specific Tools
Comprehensive testing toolkit for Java agents.
AgentVerseOpen Source Frameworks
Framework for building and testing multi-agent systems.
API-BankCategory-Specific Testing Tools
Tool-augmented LLM evaluation
Athina AICommercial Solutions
Specialized platform for LLM and agent evaluation.
AutoGenOpen Source Frameworks
Microsoft's framework for building conversational agents with comprehensive testing tools.

Simulation Environments

AI2-THORVirtual Worlds
Interactive 3D environments
CARLAVirtual Worlds
Autonomous driving simulation
Dota 2 Bot APIGame-Based Environments
Complex multi-agent environment
HabitatVirtual Worlds
Platform for embodied AI research
Meta-WorldDynamic Testing Environments
Benchmark for multi-task RL
MineDojoVirtual Worlds
Minecraft-based agent environment

Performance Testing

Apache JMeterLoad Testing
Comprehensive testing tool
JaegerLatency Analysis
Distributed tracing system
K6Load Testing
Modern load testing tool
KubernetesScalability Testing
Container orchestration
LocustLoad Testing
Scalable load testing framework
OpenTelemetryLatency Analysis
Observability framework

Observability and Monitoring

Arize AIProduction Monitoring Platforms
ML observability with LLM support.
GalileoProduction Monitoring Platforms
LLM observability and evaluation.
LangFuseProduction Monitoring Platforms
Open-source LLM observability platform.
OpenTelemetry GenAI ConventionLogging Standards
Emerging standard for AI observability.
WhyLabsProduction Monitoring Platforms
AI observability platform.

Showing a sample of 155 resources. View the full list on GitHub →