Context Awesome

awesome-gemm

github.com/yuninxia/awesome-gemm ↗

📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software

67

GitHub Stars

130

Curated Resources

8

Categories

17 hours ago

Last Refreshed

Quickstart & Highlights 🌱Fundamental Theories and Concepts 🧠General Optimization Techniques 🚀Frameworks and Development Tools 🛠️Libraries 🗂️Debugging and Profiling Tools 🔍Learning Resources 📚Example Implementations 💡

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me blogs 🖋️ resources from awesome-gemm"

Installation instructions →

What's inside

Learning Resources 📚

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS LibraryBlogs 🖋️
Anatomy of High-Performance Many-Threaded Matrix Multiplication (2014)Selected Papers 📝
Anatomy of High-Performance Matrix Multiplication (2008)Selected Papers 📝
BLIS: A Framework for Rapidly Instantiating BLAS Functionality (2015)Selected Papers 📝
Building a FAST Matrix Multiplication AlgorithmBlogs 🖋️
CUDA GEMM OptimizationBlogs 🖋️

Libraries 🗂️

ArmadilloLanguage-Specific Libraries 🔤
ARM Compute Library: Optimized for ARM platformsCross-Platform Libraries 🌍
BitBLAS-BenchmarkGPU Libraries ⚡
BitBLAS: Mixed-precision BLAS operations on GPUsGPU Libraries ⚡
BLASFEO: Optimized for small- to medium-sized dense matricesCPU Libraries 💻
BlazeLanguage-Specific Libraries 🔤

Frameworks and Development Tools 🛠️

Example Implementations 💡

Debugging and Profiling Tools 🔍

General Optimization Techniques 🚀

GEMM: From Pure C to SSE Optimized Micro Kernels
Detailed tutorial on going from naive to vectorized implementations.
How To Optimize GEMM
Hands-on optimization guide.

Fundamental Theories and Concepts 🧠

General Matrix Multiply (Intel)
Intro from Intel.
Spatial-lang GEMM
High-level overview.
Strassen's Algorithm
Faster asymptotic complexity for large matrices.
Winograd's Algorithm
Reduced multiplication count for improved performance.

Showing a sample of 130 resources. View the full list on GitHub →