awesome-ai-eval
github.com/Vvkmnn/awesome-ai-eval ↗☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me safety resources from awesome-ai-eval"
Installation instructions →What's inside
Benchmarks
- AdvBenchSafety
Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
- AgentBenchAgent
Evaluates LLMs acting as agents across simulated domains like games and coding.
- AGIEvalGeneral
Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
- ARC-AGI-2Reasoning
Next-generation reasoning benchmark where pure LLMs score 0% but humans can solve every task.
- AstaBenchAgent
AI2 benchmark for scientific research AI agents covering literature review, experiment replication, and data analysis.
- BBQSafety
Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.
Platforms
- AgentaOpen Source Platforms
End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
- Amazon Bedrock EvaluationsCloud Platforms
Managed service for scoring foundation models and RAG pipelines.
- Amazon Bedrock GuardrailsCloud Platforms
Safety layer that evaluates prompts and responses for policy compliance.
- Arize PhoenixOpen Source Platforms
OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
- Azure AI Foundry EvaluationsCloud Platforms
Evaluation flows and risk reports wired into Prompt Flow projects.
- ChatIntelHosted Platforms
Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.
Tools
- AgentrialEvaluators and Test Harnesses
Statistical evaluation framework that runs AI agents N times, computes confidence intervals, and detects regressions in CI/CD.
- Aleph Alpha Eval FrameworkEvaluators and Test Harnesses
Production-ready evaluation framework with 90+ pre-loaded benchmarks for reasoning, coding, and safety.
- AlpacaEvalPrompt Evaluation & Safety
Automated instruction-following evaluator with length-controlled LLM judge scoring.
- Anthropic Model EvalsEvaluators and Test Harnesses
Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
- ARTKITRed Teaming & Adversarial Testing
Automated multi-turn red teaming framework that simulates attacker-target interactions for jailbreak testing.
- Athina AIEvaluators and Test Harnesses
SOC-2 compliant LLM evaluation and monitoring platform with 50+ preset evaluations and VPC deployment.
Resources
- AI Evals for Engineers & PMsGuides & Training
Cohort course from Hamel & Shreya with lifetime reader, Discord, AI Eval Assistant, and live office hours.
- AlignEvalGuides & Training
Eugene Yan's guide on building LLM judges by following methodical alignment processes.
- Applied LLMsGuides & Training
Practical lessons from a year of building with LLMs, emphasizing evaluation as a core practice.
- Arize Phoenix AI ChatbotExamples
Next.js chatbot with Phoenix tracing, dataset replays, and evaluation jobs.
- Awesome ChainForgeRelated Collections
Ecosystem list centered on ChainForge experiments and extensions.
- Awesome-LLM-EvalRelated Collections
Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.
Leaderboards
- ARC Prize Leaderboard
AGI reasoning leaderboard tracking ARC-AGI-2 performance across frontier models and open submissions.
- CompassRank
OpenCompass leaderboard comparing frontier and research models across multi-domain suites.
- LLM Agents Benchmark Collections
Aggregated leaderboard comparing multi-agent safety and reliability suites.
- LMArena
Crowdsourced LLM comparison platform (formerly LMSYS Chatbot Arena) with 6M+ user votes for Elo ratings.
- OpenAI Evals Registry
Community suites and scores covering accuracy, safety, and instruction following.
- Open LLM Leaderboard
Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.
Showing a sample of 186 resources. View the full list on GitHub →