awesome-llm-quantization

github.com/pprp/awesome-llm-quantization ↗

Awesome list for LLM quantization

435

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me papers resources from awesome-llm-quantization"

Installation instructions →

What's inside

Papers

AAAI2024 What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
This paper investigates the challenges of quantizing large language models (LLMs) by viewing quantization as adding perturbations to weights and activations. The authors empirically analyze the impact of uniform quantization on different LLM families (BLOOM, OPT, LLAMA) and sizes, finding varying robustness. They propose the "lens of perturbation," artificially introducing perturbations to analyze their effect on performance. This analysis reveals connections between perturbation properties and LLM performance degradation, offering insights into uniform quantization failures. Based on these insights, a simple non-uniform quantization approach is implemented, demonstrating minimal performance degradation with 4-bit weight quantization and 8-bit weight and activation quantization. #Quantization #Perturbation #LLM
AAAI24 Oral OWQ: Outlier-Aware Quantization for Efficient Fine-tuning and Inference of Large Language ModelsGithub
Outlier-aware weight quantization (OWQ) aims to minimize the footprint through low-precision representation. It prioritizes a small subset of structured weight using Hessain matrices and applies the high precision to these subset. This approach is a mixed-precision quantization method. The final model is 3.1 bit, which achieves comparable performance to OPTQ in 4-bit. Moreover, it incorporates Weak Column Tuning using PEFT to further boost the quality of zero-shot tasks. #PTQ #Mixed-Precision #3-bit #PEFT
ACL24 BitDistiller: Unleashing the Potential of Sub 4-bit LLMs via Self-DistillationGithub
Bitdistiller is a QAT framework that utilizes Knowledge Distillation to boost the performance at Sub-4bit. BitDistiller (1) incorporates a tailored asymmetric quantization and clipping technique to perserve the fidelity of quantized weight and (2) proposes a Confidence-Aware Kullback-Leibler Divergence (CAKLD) as self-distillation loss. Experiments involve 3-bit and 2-bit configuration. #QAT #2-bit #3-bit #KD
ACL24 DB-LLM: Accurate Dual-Binarization for Efficient LLMs
DB-LLM introduces Flexible Dual Binarization (FDB) by splitting 2-bit quantized weights into two independent set of binaries (which is similar to BiLLM). It also proposes Deviation -Aware Distillation to focus differently on various samples. DB-LLM is actually a QAT framework that targeting W2A16 settings. #QAT #Binarization #2-bit
ACL24 Findings IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, this paper proposes IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. #Weight #Outliers
Arxiv2025 ADAMIX: Adaptive Mixed-Precision Delta-Compression with Quantization Error Optimization for Large Language Models
This paper introduces ADAMIX, a novel adaptive mixed-precision delta-compression framework for LLMs that optimizes quantization error during delta parameter compression. By formulating bit allocation as a 0/1 integer linear programming problem, ADAMIX minimizes quantization error under a given compression ratio, leading to significant performance improvements over existing delta-compression methods, especially on tasks where the delta parameters are large and the base model's capabilities are limited. #LLMCompression #Quantization #MixedPrecision #DeltaCompression #IntegerProgramming

Showing a sample of 86 resources. View the full list on GitHub →