Skip to main content

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications

83
GitHub Stars
186
Curated Resources
5
Categories
4 hours ago
Last Refreshed
ToolsPlatformsBenchmarksLeaderboardsResources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me safety resources from awesome-ai-eval"

Installation instructions →

What's inside

Benchmarks

  • AdvBenchSafety

    Adversarial prompt benchmark for jailbreak and misuse resistance measurement.

  • AgentBenchAgent

    Evaluates LLMs acting as agents across simulated domains like games and coding.

  • AGIEvalGeneral

    Human-centric standardized exams spanning entrance tests, legal, and math scenarios.

  • ARC-AGI-2Reasoning

    Next-generation reasoning benchmark where pure LLMs score 0% but humans can solve every task.

  • AstaBenchAgent

    AI2 benchmark for scientific research AI agents covering literature review, experiment replication, and data analysis.

  • BBQSafety

    Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.

Platforms

  • AgentaOpen Source Platforms

    End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.

  • Amazon Bedrock EvaluationsCloud Platforms

    Managed service for scoring foundation models and RAG pipelines.

  • Amazon Bedrock GuardrailsCloud Platforms

    Safety layer that evaluates prompts and responses for policy compliance.

  • Arize PhoenixOpen Source Platforms

    OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.

  • Azure AI Foundry EvaluationsCloud Platforms

    Evaluation flows and risk reports wired into Prompt Flow projects.

  • ChatIntelHosted Platforms

    Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.

Tools

  • AgentrialEvaluators and Test Harnesses

    Statistical evaluation framework that runs AI agents N times, computes confidence intervals, and detects regressions in CI/CD.

  • Aleph Alpha Eval FrameworkEvaluators and Test Harnesses

    Production-ready evaluation framework with 90+ pre-loaded benchmarks for reasoning, coding, and safety.

  • AlpacaEvalPrompt Evaluation & Safety

    Automated instruction-following evaluator with length-controlled LLM judge scoring.

  • Anthropic Model EvalsEvaluators and Test Harnesses

    Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.

  • ARTKITRed Teaming & Adversarial Testing

    Automated multi-turn red teaming framework that simulates attacker-target interactions for jailbreak testing.

  • Athina AIEvaluators and Test Harnesses

    SOC-2 compliant LLM evaluation and monitoring platform with 50+ preset evaluations and VPC deployment.

Resources

  • AI Evals for Engineers & PMsGuides & Training

    Cohort course from Hamel & Shreya with lifetime reader, Discord, AI Eval Assistant, and live office hours.

  • AlignEvalGuides & Training

    Eugene Yan's guide on building LLM judges by following methodical alignment processes.

  • Applied LLMsGuides & Training

    Practical lessons from a year of building with LLMs, emphasizing evaluation as a core practice.

  • Arize Phoenix AI ChatbotExamples

    Next.js chatbot with Phoenix tracing, dataset replays, and evaluation jobs.

  • Awesome ChainForgeRelated Collections

    Ecosystem list centered on ChainForge experiments and extensions.

  • Awesome-LLM-EvalRelated Collections

    Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.

Leaderboards

  • ARC Prize Leaderboard

    AGI reasoning leaderboard tracking ARC-AGI-2 performance across frontier models and open submissions.

  • CompassRank

    OpenCompass leaderboard comparing frontier and research models across multi-domain suites.

  • LLM Agents Benchmark Collections

    Aggregated leaderboard comparing multi-agent safety and reliability suites.

  • LMArena

    Crowdsourced LLM comparison platform (formerly LMSYS Chatbot Arena) with 6M+ user votes for Elo ratings.

  • OpenAI Evals Registry

    Community suites and scores covering accuracy, safety, and instruction following.

  • Open LLM Leaderboard

    Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.

Showing a sample of 186 resources. View the full list on GitHub →