awesome-production-machine-learning

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

21k

GitHub Stars

545

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me computation and communication optimisation resources from awesome-production-machine-learning"

Accelerate
Accelerate abstracts exactly and only the boilerplate code related to multi-GPU/TPU/mixed-precision and leaves the rest of your code unchanged.
Adapters
Adapters is a unified library for parameter-efficient and modular transfer learning.
BitBLAS
BitBLAS is a library to support mixed-precision BLAS operations on GPUs
bitsandbytes
Bitsandbytes library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.
Cache-DiT
Cache-DiT is built on top of Diffusers and supports nearly all DiTs, providing hybrid cache acceleration (DBCache, TaylorSeer, SCM, etc.) and comprehensive parallelism optimizations including Context Parallelism, Tensor Parallelism, and hybrid 2D/3D parallelism, with compatibility for compilation, CPU offloading, and quantization.
Colossal-AI
A unified deep learning system for big model era, which helps users to efficiently and quickly deploy large AI model training and inference.

Acme
Acme is a library of reinforcement learning (RL) building blocks that strives to expose simple, efficient, and readable agents.
AReaL
AReaL is a reinforcement learning library.
ChatLearn
ChatLearn is a flexible and efficient reinforcement learning training framework for large language models, supporting distributed training engines (FSDP2, Megatron) and inference engines (vLLM, SGLang) with modern RL algorithms such as GRPO and GSPO.
CleanRL
CleanRL is a Deep Reinforcement Learning library that provides high-quality single-file implementation with research-friendly features. The implementation is clean and simple, yet we can scale it to run thousands of experiments using AWS Batch.
CompilerGym
CompilerGym is a library of easy to use and performant reinforcement learning environments for compiler tasks.
d3rlpy
d3rlpy is an offline deep reinforcement learning library for practitioners and researchers.

Aequitas
An open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive risk-assessment tools.
AI Explainability 360
Interpretability and explainability of data and machine learning models including a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics.
AI Fairness 360
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
Alibi
Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The initial focus on the library is on black-box, instance based model explanations.
captum
model interpretability and understanding library for PyTorch developed by Facebook. It contains general purpose implementations of integrated gradients, saliency maps, smoothgrad, vargrad and others for PyTorch models.

Agenta
Agenta provides end-to-end tools for the entire LLMOps workflow: building (LLM playground, evaluation), deploying (prompt and configuration management), and (LLM observability and tracing).
AirLLM
AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning.
AITemplate
AITemplate (AIT) is a Python framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving.
BentoML
BentoML is an open source framework for high performance ML model serving.
BISHENG
BISHENG is an open LLM application devops platform, focusing on enterprise scenarios.
DeepDetect
Machine Learning production server for TensorFlow, XGBoost and Cafe models written in C++ and maintained by Jolibrain.

AI2-THOR
AI2-THOR is a near photo-realistic interactable framework for AI agents.

AIDE
AIDE is an open-source ML engineering agent that uses a tree search algorithm to autonomously explore, implement, and evaluate solution strategies for machine learning tasks.
AutoGluon
Automated feature, model, and hyperparameter selection for tabular, image, and text data on top of popular machine learning libraries (Scikit-Learn, LightGBM, CatBoost, PyTorch, MXNet).
Autokeras
AutoML library for Keras based on
auto-sklearn
Framework to automate algorithm and hyperparameter tuning for sklearn.
Ax
Ax is an accessible, general-purpose platform for understanding, managing, deploying, and automating adaptive experiments.
BoTorch
BoTorch is a library for Bayesian Optimization built on PyTorch.

AI Gateway
The AI Gateway is a blazing fast AI Gateway with integrated guardrails.
ART
ART (Adversarial Robustness Toolbox) provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference.
Awesome Agentic Engineering Resources
A curated collection of resources, tools, and references for building agentic AI systems.
Awesome AI Gateway
Curated, bilingual (EN/zh-CN) list of AI gateways and LLM proxies (LiteLLM, OpenRouter, Portkey, Kong, Higress, new-api) compared by cost, compliance, self-hosting and routing, with a decision tree, reproducible cost benchmarks and a selection scorecard.
Awesome AI Regulation
Covers governance, compliance, and regulatory frameworks essential for responsible ML system deployment across different jurisdictions.
Awesome Production GenAI
Focuses specifically on generative AI deployment, including LLM operations, prompt engineering, and GenAI-specific monitoring and safety tools.

Aim
A super-easy way to record, search and compare AI experiments.
ClearML
Auto-Magical Experiment Manager & Version Control for AI (previously Trains).
DataHub
DataHub is an open-source data catalog for the modern data stack.
Dolt
Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository.
DVC
DVC (Data Version Control) is a git fork that allows for version management of models.

Showing a sample of 545 resources. View the full list on GitHub →