Skip to main content

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

642
GitHub Stars
333
Curated Resources
10
Categories
22 hours ago
Last Refreshed
NewsAnthropomorphic-TaxonomyDatasets or BenchmarksDemosPapersLLM-ListFrameworks-for-TrainingLLMOpsCoursesOther-Awesome-Lists

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me typical emotional quotient (eq)-alignment ability evaluation benchmarks resources from awesome-llm-eval"

Installation instructions →

What's inside

Frameworks-for-Training

  • Accelerate

    🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.

  • Apache MXNet

    Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.

  • Caffe

    A fast open framework for deep learning.

  • ColossalAI

    An integrated large-scale model training system with efficient parallelization techniques.

  • DeepSpeed

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

  • Horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Anthropomorphic-Taxonomy

  • AdvBenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks

    2023

  • AgentHarmTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks

    2024

  • AGIEvalTypical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks

    2023

  • AIMETypical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks

    2024

  • AIR-BenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks

    2024

  • AlignBenchTypical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks

    2023

LLMOps

  • agenta

    An LLMOps platform for building powerful LLM applications. It allows for easy experimentation and evaluation of different prompts, models, and workflows to construct robust applications.

  • Arize-Phoenix

    ML observability for LLMs, vision, language, and tabular models.

  • BudgetML

    Deploy ML inference services on a limited budget with less than 10 lines of code.

  • Byzer-LLM

    Byzer-LLM is a comprehensive large model infrastructure that supports capabilities related to large models, such as pre-training, fine-tuning, deployment, and serving. Byzer-Retrieval is a storage infrastructure specifically developed for large models, supporting batch import of various data sources, real-time single-item updates, and full-text, vector, and hybrid searches to facilitate data usage for Byzer-LLM. Byzer-SQL/Python offers user-friendly interactive APIs with a low barrier to entry for utilizing the aforementioned products.

  • CometLLM

    An open-source LLMOps platform for logging, managing, and visualizing LLM prompts and chains. It tracks prompt templates, variables, duration, token usage, and other metadata. It also scores prompt outputs and visualizes chat history in a single UI.

  • deeplake

    Stream large multimodal datasets to achieve near 100% GPU utilization. Query, visualize, and version control data. Access data without recalculating embeddings for model fine-tuning.

Datasets or Benchmarks

LLM-List

  • AlpacaOpen-LLM

    A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.

  • Alpaca-7BPopular-LLM

    alpaca

  • Alpaca-LoRA-7BPopular-LLM

    2023alpacalora

  • BaichuanOpen-LLM

    An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)

  • BaizeOpen-LLM

    Baize is an open-source chat model trained with

  • Baize-13BPopular-LLM

    xu2023baize

Other-Awesome-Lists

Demos

  • Chat Arena: anonymous models side-by-side and vote for which one is better

    An open-source AI LLM "anonymous" arena! Here, you can become a judge, score two model responses without knowing their identities, and after scoring, the true identities of the models will be revealed. Participants include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.

Showing a sample of 333 resources. View the full list on GitHub →