awesome-ml-model-compression
github.com/cedrickchee/awesome-ml-model-compression ↗Awesome machine learning model compression research papers, quantization, tools, and learning material.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me quantization resources from awesome-ml-model-compression"
Installation instructions →What's inside
Papers
- 8-bit Optimizers via Block-wise QuantizationQuantization
- Accelerating Very Deep Convolutional Networks for Classification and DetectionLow Rank Approximation
- AddressNet: Shift-based Primitives for Efficient Convolutional Neural NetworksArchitecture
- AMC: AutoML for model compression and acceleration on mobile devicesPruning
- And the bit goes down: Revisiting the quantization of neural networksQuantization
- A Simple and Effective Pruning Approach for Large Language ModelsPruning
The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code:
Articles
- A foolproof way to shrink deep learning modelsAssorted
A pruning algorithm: train to completion, globally prune the 20% of weights with the lowest magnitudes (the weakest connections), retrain with
- All The Ways You Can Compress BERTBlogs
An overview of different compression methods for large NLP models (BERT) based on different characteristics and compares their results.
- A Visual Guide to QuantizationBlogs
Demystifying the Compression of Large Language Models by Maarten Grootendorst (Jul 2024) - An approachable and great primer into quantization and widely supported quantization methods in tools and libraries including GPTQ, GGUF, and BitNet (1-bit).
- Breakdown of Nvidia H100s for Transformer InferencingBlogs
- Comparing Quantized Performance in Llama ModelsBlogs
8 bit quantized seems fine, for 4 bit it depends. It covers different quantization schemes including GGUF,
- Comparison between quantization techniques and formats for LLMsBlogs
Tools
- BitsandbytesLibraries
- facebookresearch/kill-the-bitsPaper Implementations
code and compressed models for the paper, "And the bit goes down: Revisiting the quantization of neural networks" by Facebook AI Research.
- NNCPLibraries
An experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model (slower but best ratio). LSTM (faster) is also available.
- TensorFlow Model Optimization ToolkitLibraries
- XNNPACKLibraries
Showing a sample of 116 resources. View the full list on GitHub →