Skip to main content

Awesome list for LLM quantization

426
GitHub Stars
86
Curated Resources
1
Categories
5 hours ago
Last Refreshed
Papers

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me papers resources from awesome-llm-quantization"

Installation instructions →

What's inside

Papers

  • AAAI2024 What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation

    This paper investigates the challenges of quantizing large language models (LLMs) by viewing quantization as adding perturbations to weights and activations. The authors empirically analyze the impact of uniform quantization on different LLM families (BLOOM, OPT, LLAMA) and sizes, finding varying robustness. They propose the "lens of perturbation," artificially introducing perturbations to analyze their effect on performance. This analysis reveals connections between perturbation properties and LLM performance degradation, offering insights into uniform quantization failures. Based on these insights, a simple non-uniform quantization approach is implemented, demonstrating minimal performance degradation with 4-bit weight quantization and 8-bit weight and activation quantization. #Quantization #Perturbation #LLM

  • AAAI24 Oral OWQ: Outlier-Aware Quantization for Efficient Fine-tuning and Inference of Large Language ModelsGithub

    Outlier-aware weight quantization (OWQ) aims to minimize the footprint through low-precision representation. It prioritizes a small subset of structured weight using Hessain matrices and applies the high precision to these subset. This approach is a mixed-precision quantization method. The final model is 3.1 bit, which achieves comparable performance to OPTQ in 4-bit. Moreover, it incorporates Weak Column Tuning using PEFT to further boost the quality of zero-shot tasks. #PTQ #Mixed-Precision #3-bit #PEFT

  • ACL24 BitDistiller: Unleashing the Potential of Sub 4-bit LLMs via Self-DistillationGithub

    Bitdistiller is a QAT framework that utilizes Knowledge Distillation to boost the performance at Sub-4bit. BitDistiller (1) incorporates a tailored asymmetric quantization and clipping technique to perserve the fidelity of quantized weight and (2) proposes a Confidence-Aware Kullback-Leibler Divergence (CAKLD) as self-distillation loss. Experiments involve 3-bit and 2-bit configuration. #QAT #2-bit #3-bit #KD

  • ACL24 DB-LLM: Accurate Dual-Binarization for Efficient LLMs

    DB-LLM introduces Flexible Dual Binarization (FDB) by splitting 2-bit quantized weights into two independent set of binaries (which is similar to BiLLM). It also proposes Deviation -Aware Distillation to focus differently on various samples. DB-LLM is actually a QAT framework that targeting W2A16 settings. #QAT #Binarization #2-bit

  • ACL24 Findings IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

    This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, this paper proposes IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. #Weight #Outliers

  • Arxiv2025 ADAMIX: Adaptive Mixed-Precision Delta-Compression with Quantization Error Optimization for Large Language Models

    This paper introduces ADAMIX, a novel adaptive mixed-precision delta-compression framework for LLMs that optimizes quantization error during delta parameter compression. By formulating bit allocation as a 0/1 integer linear programming problem, ADAMIX minimizes quantization error under a given compression ratio, leading to significant performance improvements over existing delta-compression methods, especially on tasks where the delta parameters are large and the base model's capabilities are limited. #LLMCompression #Quantization #MixedPrecision #DeltaCompression #IntegerProgramming

Showing a sample of 86 resources. View the full list on GitHub →