awesome-gemm
github.com/yuninxia/awesome-gemm ↗📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
65
GitHub Stars
130
Curated Resources
8
Categories
21 hours ago
Last Refreshed
Quickstart & Highlights 🌱Fundamental Theories and Concepts 🧠General Optimization Techniques 🚀Frameworks and Development Tools 🛠️Libraries 🗂️Debugging and Profiling Tools 🔍Learning Resources 📚Example Implementations 💡
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me blogs 🖋️ resources from awesome-gemm"
Installation instructions →What's inside
Learning Resources 📚
- A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS LibraryBlogs 🖋️
- Anatomy of High-Performance Many-Threaded Matrix Multiplication (2014)Selected Papers 📝
- Anatomy of High-Performance Matrix Multiplication (2008)Selected Papers 📝
- BLIS: A Framework for Rapidly Instantiating BLAS Functionality (2015)Selected Papers 📝
- Building a FAST Matrix Multiplication AlgorithmBlogs 🖋️
- CUDA GEMM OptimizationBlogs 🖋️
Libraries 🗂️
- ArmadilloLanguage-Specific Libraries 🔤
- ARM Compute Library: Optimized for ARM platformsCross-Platform Libraries 🌍
- BitBLAS-BenchmarkGPU Libraries ⚡
- BitBLAS: Mixed-precision BLAS operations on GPUsGPU Libraries ⚡
- BLASFEO: Optimized for small- to medium-sized dense matricesCPU Libraries 💻
- BlazeLanguage-Specific Libraries 🔤
Frameworks and Development Tools 🛠️
- BLIS: A modular framework for building high-performance BLAS-like libraries
- BLISlab: Educational framework for experimenting with BLIS-like GEMM algorithms
- Tensile: AMD ROCm JIT compiler for GPU kernels, specializing in GEMM and tensor contractions
- Tile Language: A concise DSL designed to streamline development of high-performance GPU/CPU kernels like GEMM
Example Implementations 💡
Debugging and Profiling Tools 🔍
General Optimization Techniques 🚀
- GEMM: From Pure C to SSE Optimized Micro Kernels
Detailed tutorial on going from naive to vectorized implementations.
- How To Optimize GEMM
Hands-on optimization guide.
Fundamental Theories and Concepts 🧠
- General Matrix Multiply (Intel)
Intro from Intel.
- Spatial-lang GEMM
High-level overview.
- Strassen's Algorithm
Faster asymptotic complexity for large matrices.
- Winograd's Algorithm
Reduced multiplication count for improved performance.
Showing a sample of 130 resources. View the full list on GitHub →