awesome-ai-eval

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications

GitHub Stars

186

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me safety resources from awesome-ai-eval"

AdvBenchSafety
Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
AgentBenchAgent
Evaluates LLMs acting as agents across simulated domains like games and coding.
AGIEvalGeneral
Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
ARC-AGI-2Reasoning
Next-generation reasoning benchmark where pure LLMs score 0% but humans can solve every task.
AstaBenchAgent
AI2 benchmark for scientific research AI agents covering literature review, experiment replication, and data analysis.
BBQSafety
Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.

AgentaOpen Source Platforms
End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
Amazon Bedrock EvaluationsCloud Platforms
Managed service for scoring foundation models and RAG pipelines.
Amazon Bedrock GuardrailsCloud Platforms
Safety layer that evaluates prompts and responses for policy compliance.
Arize PhoenixOpen Source Platforms
OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
Azure AI Foundry EvaluationsCloud Platforms
Evaluation flows and risk reports wired into Prompt Flow projects.
ChatIntelHosted Platforms
Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.

AgentrialEvaluators and Test Harnesses
Statistical evaluation framework that runs AI agents N times, computes confidence intervals, and detects regressions in CI/CD.
Aleph Alpha Eval FrameworkEvaluators and Test Harnesses
Production-ready evaluation framework with 90+ pre-loaded benchmarks for reasoning, coding, and safety.
AlpacaEvalPrompt Evaluation & Safety
Automated instruction-following evaluator with length-controlled LLM judge scoring.
Anthropic Model EvalsEvaluators and Test Harnesses
Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
ARTKITRed Teaming & Adversarial Testing
Automated multi-turn red teaming framework that simulates attacker-target interactions for jailbreak testing.
Athina AIEvaluators and Test Harnesses
SOC-2 compliant LLM evaluation and monitoring platform with 50+ preset evaluations and VPC deployment.

AI Evals for Engineers & PMsGuides & Training
Cohort course from Hamel & Shreya with lifetime reader, Discord, AI Eval Assistant, and live office hours.
AlignEvalGuides & Training
Eugene Yan's guide on building LLM judges by following methodical alignment processes.
Applied LLMsGuides & Training
Practical lessons from a year of building with LLMs, emphasizing evaluation as a core practice.
Arize Phoenix AI ChatbotExamples
Next.js chatbot with Phoenix tracing, dataset replays, and evaluation jobs.
Awesome ChainForgeRelated Collections
Ecosystem list centered on ChainForge experiments and extensions.
Awesome-LLM-EvalRelated Collections
Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.

ARC Prize Leaderboard
AGI reasoning leaderboard tracking ARC-AGI-2 performance across frontier models and open submissions.
CompassRank
OpenCompass leaderboard comparing frontier and research models across multi-domain suites.
LLM Agents Benchmark Collections
Aggregated leaderboard comparing multi-agent safety and reliability suites.
LMArena
Crowdsourced LLM comparison platform (formerly LMSYS Chatbot Arena) with 6M+ user votes for Elo ratings.
OpenAI Evals Registry
Community suites and scores covering accuracy, safety, and instruction following.
Open LLM Leaderboard
Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.

Showing a sample of 186 resources. View the full list on GitHub →