Skip to main content

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

59
GitHub Stars
51
Curated Resources
10
Categories
58 min ago
Last Refreshed
Distributed TrainingModel Serving and DeploymentMLOps and AutomationData ManagementOptimization ToolsInfrastructure as CodeCloud PlatformsLearning ResourcesBooksCommunity

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me mlops and automation resources from awesome-ai-infrastructure"

Installation instructions →

What's inside

MLOps and Automation

  • Airflow

    A platform for orchestrating complex workflows, commonly used in machine learning pipelines.

  • DVC (Data Version Control)

    A tool for version control and reproducibility in machine learning projects.

  • Kubeflow

    A platform for orchestrating machine learning workflows on Kubernetes.

  • Metaflow

    A human-centric framework for building and managing real-life data science projects, developed by Netflix.

  • MLflow

    An open-source platform for managing the end-to-end machine learning lifecycle.

  • ZenML

    An extensible MLOps framework for creating portable, production-ready machine learning pipelines.

Infrastructure as Code

  • Ansible

    An open-source automation tool for provisioning and managing infrastructure.

  • AWS CloudFormation

    A service for automating AWS resource deployment and management.

  • Google Deployment Manager

    An infrastructure management tool for Google Cloud Platform.

  • Pulumi

    Infrastructure as code for deploying and managing cloud infrastructure using programming languages.

  • Terraform

    A tool for building, changing, and versioning infrastructure safely and efficiently.

Data Management

  • Apache Hudi

    A data management framework that simplifies incremental data processing and streaming analytics.

  • Delta Lake

    An open-source storage layer that brings reliability to data lakes.

  • Feast

    An open-source feature store for managing and serving machine learning features.

  • Great Expectations

    A tool for data validation and testing in machine learning workflows.

  • LakeFS

    An open-source data versioning platform for managing data lakes.

Optimization Tools

  • Apache TVM

    A deep learning compiler stack for optimizing models on various hardware backends.

  • Intel OpenVINO

    A toolkit for optimizing and deploying AI inference on Intel hardware.

  • NVIDIA TensorRT

    A high-performance deep learning inference optimizer and runtime.

  • OctoML

    An AI model optimization platform for efficient deployment on edge and cloud.

  • Quantization Aware Training (QAT)

    Tools for optimizing model performance through quantization.

Cloud Platforms

  • AWS SageMaker

    A comprehensive platform for building, training, and deploying machine learning models on AWS.

  • Azure Machine Learning

    A cloud-based platform for training, deploying, and managing machine learning models.

  • Google AI Platform

    Google Cloud’s integrated environment for AI development and deployment.

  • IBM Watson Studio

    A suite of tools for data science, machine learning, and AI model development.

  • KubeStellar Console

    A CNCF Sandbox multi-cluster Kubernetes management dashboard for deploying and observing workloads across edge and cloud infrastructure.

  • Paperspace Gradient

    A cloud platform for developing, training, and deploying machine learning models.

Learning Resources

Books

Distributed Training

  • DeepSpeed

    A deep learning optimization library that makes distributed training easy and efficient.

  • Horovod

    A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.

  • MPI for Machine Learning

    Using the Message Passing Interface (MPI) standard for distributed machine learning.

  • Ray

    A framework for building scalable distributed applications, including distributed AI and reinforcement learning.

Showing a sample of 51 resources. View the full list on GitHub →