awesome-ml-model-compression

Awesome machine learning model compression research papers, quantization, tools, and learning material.

546

GitHub Stars

116

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me quantization resources from awesome-ml-model-compression"

8-bit Optimizers via Block-wise QuantizationQuantization
Accelerating Very Deep Convolutional Networks for Classification and DetectionLow Rank Approximation
AddressNet: Shift-based Primitives for Efficient Convolutional Neural NetworksArchitecture
AMC: AutoML for model compression and acceleration on mobile devicesPruning
And the bit goes down: Revisiting the quantization of neural networksQuantization
A Simple and Effective Pruning Approach for Large Language ModelsPruning
The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code:

A foolproof way to shrink deep learning modelsAssorted
A pruning algorithm: train to completion, globally prune the 20% of weights with the lowest magnitudes (the weakest connections), retrain with
All The Ways You Can Compress BERTBlogs
An overview of different compression methods for large NLP models (BERT) based on different characteristics and compares their results.
A Visual Guide to QuantizationBlogs
Demystifying the Compression of Large Language Models by Maarten Grootendorst (Jul 2024) - An approachable and great primer into quantization and widely supported quantization methods in tools and libraries including GPTQ, GGUF, and BitNet (1-bit).
Breakdown of Nvidia H100s for Transformer InferencingBlogs
Comparing Quantized Performance in Llama ModelsBlogs
8 bit quantized seems fine, for 4 bit it depends. It covers different quantization schemes including GGUF,
Comparison between quantization techniques and formats for LLMsBlogs

BitsandbytesLibraries
facebookresearch/kill-the-bitsPaper Implementations
code and compressed models for the paper, "And the bit goes down: Revisiting the quantization of neural networks" by Facebook AI Research.
NNCPLibraries
An experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model (slower but best ratio). LSTM (faster) is also available.
TensorFlow Model Optimization ToolkitLibraries
XNNPACKLibraries

Showing a sample of 116 resources. View the full list on GitHub →