awesome-llm-eval

github.com/onejune2018/awesome-llm-eval ↗

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

653

GitHub Stars

333

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me typical emotional quotient (eq)-alignment ability evaluation benchmarks resources from awesome-llm-eval"

Installation instructions →

What's inside

Frameworks-for-Training

Accelerate
🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
Apache MXNet
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
Caffe
A fast open framework for deep learning.
ColossalAI
An integrated large-scale model training system with efficient parallelization techniques.
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Anthropomorphic-Taxonomy

AdvBenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2023
AgentHarmTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2024
AGIEvalTypical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks
2023
AIMETypical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks
2024
AIR-BenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2024
AlignBenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2023

LLMOps

agenta
An LLMOps platform for building powerful LLM applications. It allows for easy experimentation and evaluation of different prompts, models, and workflows to construct robust applications.
Arize-Phoenix
ML observability for LLMs, vision, language, and tabular models.
BudgetML
Deploy ML inference services on a limited budget with less than 10 lines of code.
Byzer-LLM
Byzer-LLM is a comprehensive large model infrastructure that supports capabilities related to large models, such as pre-training, fine-tuning, deployment, and serving. Byzer-Retrieval is a storage infrastructure specifically developed for large models, supporting batch import of various data sources, real-time single-item updates, and full-text, vector, and hybrid searches to facilitate data usage for Byzer-LLM. Byzer-SQL/Python offers user-friendly interactive APIs with a low barrier to entry for utilizing the aforementioned products.
CometLLM
An open-source LLMOps platform for logging, managing, and visualizing LLM prompts and chains. It tracks prompt templates, variables, duration, token usage, and other metadata. It also scores prompt outputs and visualizes chat history in a single UI.
deeplake
Stream large multimodal datasets to achieve near 100% GPU utilization. Query, visualize, and version control data. Access data without recalculating embeddings for model fine-tuning.

Datasets or Benchmarks

AgentBench Reasoning and Decision-making Evaluation LeaderboardAgent-Capabilities
THUDM
ARESRAG-Evaluation
Stanford
BERGENRAG-Evaluation
NAVER
BLURBDomain
Mindrank AI
ChartVLMMulti-modal/Cross-modal
Shanghai AI Lab
CRAGRAG-Evaluation
Meta Reality Labs

LLM-List

AlpacaOpen-LLM
A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
Alpaca-7BPopular-LLM
alpaca
Alpaca-LoRA-7BPopular-LLM
2023alpacalora
BaichuanOpen-LLM
An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
BaizeOpen-LLM
Baize is an open-source chat model trained with
Baize-13BPopular-LLM
xu2023baize

News

Other-Awesome-Lists

Awesome ChatGPT
Curated list of resources for ChatGPT and GPT-3 from OpenAI.
Awesome ChatGPT Prompts
A collection of prompt examples to be used with the ChatGPT model.
awesome-chatgpt-prompts-zh
A Chinese collection of prompt examples to be used with the ChatGPT model.
Awesome-Efficient-LLM
A curated list for Efficient Large Language Models.
Awesome GPT
A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
Awesome GPT-3
a collection of demos and articles about the

Demos

Chat Arena: anonymous models side-by-side and vote for which one is better
An open-source AI LLM "anonymous" arena! Here, you can become a judge, score two model responses without knowing their identities, and after scoring, the true identities of the models will be revealed. Participants include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.

Showing a sample of 333 resources. View the full list on GitHub →