awesome-llm-eval
github.com/onejune2018/awesome-llm-eval ↗Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me typical emotional quotient (eq)-alignment ability evaluation benchmarks resources from awesome-llm-eval"
Installation instructions →What's inside
Frameworks-for-Training
- Accelerate
🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
- Apache MXNet
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
- Caffe
A fast open framework for deep learning.
- ColossalAI
An integrated large-scale model training system with efficient parallelization techniques.
- DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
- Horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Anthropomorphic-Taxonomy
- AdvBenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2023
- AgentHarmTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2024
- AGIEvalTypical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks
2023
- AIMETypical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks
2024
- AIR-BenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2024
- AlignBenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks
2023
LLMOps
- agenta
An LLMOps platform for building powerful LLM applications. It allows for easy experimentation and evaluation of different prompts, models, and workflows to construct robust applications.
- Arize-Phoenix
ML observability for LLMs, vision, language, and tabular models.
- BudgetML
Deploy ML inference services on a limited budget with less than 10 lines of code.
- Byzer-LLM
Byzer-LLM is a comprehensive large model infrastructure that supports capabilities related to large models, such as pre-training, fine-tuning, deployment, and serving. Byzer-Retrieval is a storage infrastructure specifically developed for large models, supporting batch import of various data sources, real-time single-item updates, and full-text, vector, and hybrid searches to facilitate data usage for Byzer-LLM. Byzer-SQL/Python offers user-friendly interactive APIs with a low barrier to entry for utilizing the aforementioned products.
- CometLLM
An open-source LLMOps platform for logging, managing, and visualizing LLM prompts and chains. It tracks prompt templates, variables, duration, token usage, and other metadata. It also scores prompt outputs and visualizes chat history in a single UI.
- deeplake
Stream large multimodal datasets to achieve near 100% GPU utilization. Query, visualize, and version control data. Access data without recalculating embeddings for model fine-tuning.
Datasets or Benchmarks
LLM-List
- AlpacaOpen-LLM
A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
- Alpaca-7BPopular-LLM
alpaca
- Alpaca-LoRA-7BPopular-LLM
2023alpacalora
- BaichuanOpen-LLM
An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
- BaizeOpen-LLM
Baize is an open-source chat model trained with
- Baize-13BPopular-LLM
xu2023baize
Other-Awesome-Lists
- Awesome ChatGPT
Curated list of resources for ChatGPT and GPT-3 from OpenAI.
- Awesome ChatGPT Prompts
A collection of prompt examples to be used with the ChatGPT model.
- awesome-chatgpt-prompts-zh
A Chinese collection of prompt examples to be used with the ChatGPT model.
- Awesome-Efficient-LLM
A curated list for Efficient Large Language Models.
- Awesome GPT
A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
- Awesome GPT-3
a collection of demos and articles about the
Demos
- Chat Arena: anonymous models side-by-side and vote for which one is better
An open-source AI LLM "anonymous" arena! Here, you can become a judge, score two model responses without knowing their identities, and after scoring, the true identities of the models will be revealed. Participants include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.
Showing a sample of 333 resources. View the full list on GitHub →