ml-systems-papers
github.com/byungsoo-oh/ml-systems-papers ↗Curated collection of papers in machine learning systems
583
GitHub Stars
1.4k
Curated Resources
25
Categories
2 hours ago
Last Refreshed
Data ProcessingTraining SystemInference SystemAttention OptimizationMixture of Experts (MoE)Communication Optimization & Network Infrastructure for Distributed MLFault tolerance & Straggler mitigationGPU Memory Management & OptimizationGPU SharingCompilerGPU Kernel OptimizationLLM Long ContextModel CompressionFederated LearningPrivacy-Preserving MLML APIs & Application-Side OptimizationML for SystemsEnergy EfficiencyRetrieval-Augmented Generation (RAG)SimulationSystems for Agentic AIRL Post-TrainingMultimodalHybrid LLMsOthers
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me distributed training resources from ml-systems-papers"
Installation instructions →What's inside
Model Compression
- 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
- Accelerating Distributed Deep Learning using Lossless Homomorphic Compression
- AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
- BitNet: 1-bit Pre-training for Large Language Models
- Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
Inference System
- Accelerated Diffusion Models via Speculative Sampling
- Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
- Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management
- Accelerating Sparse Transformer Inference on GPU
- AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
- AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving
Training System
- Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-BatchingDistributed training
- Accelerating Heterogeneous Tensor Parallelism via Flexible Workload ControlDistributed training
- Accelerating Parallel Sampling of Diffusion ModelsDistributed training
- Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid ParallelismDistributed training
- Aceso: Efficient Parallel DNN Training through Iterative Bottleneck AlleviationDistributed training
- AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and PartitioningDistributed training
Mixture of Experts (MoE)
- Accelerating Distributed MoE Training and Inference with Lina
- Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
- Accelerating Mixture-of-Experts Training with Adaptive Expert Replication
- Accelerating MoE Model Inference with Expert Sharding
- Ada-K Routing: Boosting the Efficiency of MoE-based LLMs
- AdaMOE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
GPU Memory Management & Optimization
Communication Optimization & Network Infrastructure for Distributed ML
- Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
- Analyzing Communication Predictability in LLM Training
- An Extensible Software Transport Layer for GPU Networking
- An in-network architecture for accelerating shared-memory multiprocessor collectives
- ARK: GPU-driven Code Execution for Distributed Deep Learning
- arxiv
Showing a sample of 1.4k resources. View the full list on GitHub →