ml-systems-papers

github.com/byungsoo-oh/ml-systems-papers ↗

Curated collection of papers in machine learning systems

637

GitHub Stars

1.4k

Curated Resources

Categories

20 hours ago

Last Refreshed

Data ProcessingTraining SystemInference SystemAttention OptimizationMixture of Experts (MoE)Communication Optimization & Network Infrastructure for Distributed MLFault tolerance & Straggler mitigationGPU Memory Management & OptimizationGPU SharingCompilerGPU Kernel OptimizationLLM Long ContextModel CompressionFederated LearningPrivacy-Preserving MLML APIs & Application-Side OptimizationML for SystemsEnergy EfficiencyRetrieval-Augmented Generation (RAG)SimulationSystems for Agentic AIRL Post-TrainingMultimodalHybrid LLMsOthers

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me distributed training resources from ml-systems-papers"

Installation instructions →

What's inside

Model Compression

Inference System

Simulation

Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

Training System

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-BatchingDistributed training
Accelerating Heterogeneous Tensor Parallelism via Flexible Workload ControlDistributed training
Accelerating Parallel Sampling of Diffusion ModelsDistributed training
Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid ParallelismDistributed training
Aceso: Efficient Parallel DNN Training through Iterative Bottleneck AlleviationDistributed training
AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and PartitioningDistributed training