awesome-sre

Awesome SRE page

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me 3. devops resources from awesome-sre"

Accelerate: State of DevOps
Google Cloud's annual State of DevOps report and resources.
DevOps Roadmap
Community-driven roadmap for DevOps practitioners.
DORA Metrics
Research program that identifies the capabilities that drive software delivery and operations performance.
The DevOps Handbook
Practical guide for implementing DevOps in any organization.
The Phoenix Project
A novel about IT, DevOps, and helping your business win.

A collection of post-mortems
Curated collection of post-mortems from various companies and incidents.
A collection of postmortem templates
Collection of templates for writing effective post-mortems.
AlertOps
Transforms real-time operational intelligence into automated incident response.
Bigpanda
AIOps event correlation and automation platform.
Blameless
SRE platform for incident management, retrospectives, and reliability insights.
Cabot
Get alerted when services go down or metrics go crazy.

Ansible
Simple, agentless IT automation platform for configuration management, application deployment, and orchestration.
Eliminating Toil - Google SRE Book
Chapter on identifying and reducing toil in SRE practice.
Pulumi
Infrastructure as Code using familiar programming languages like Python, Go, JavaScript, TypeScript, and C#.
Shoreline
Incident automation platform that enables on-call engineers to debug and repair production issues with real-time automation.
StackStorm
Open source event-driven platform for runbook automation, ChatOps, and auto-remediation.
Terraform
Infrastructure as Code tool for building, changing, and versioning infrastructure safely and efficiently.

Awesome On-Call
Collection of articles on how companies handle on-call.
Being On-Call - Google SRE Book
Google's guide on on-call best practices and sustainable workloads.
Grafana OnCall
Open source on-call management tool with calendar integration, escalation chains, and ChatOps.
PagerDuty On-Call
Automated scheduling, escalation policies, and on-call reporting.

Blameless Post-Mortems
Chapter from the Google SRE book on creating a blameless post-mortem culture.
Love DevOps? Wait until you meet SRE
Atlassian's take on the relationship between SRE and DevOps culture.
SRE vs DevOps: What's the Difference?
Google Cloud blog post explaining the relationship between SRE and DevOps.
What is the role of a Site Reliability Engineer?
Overview of the SRE role and responsibilities.

Building Secure and Reliable Systems
Combines security and reliability practices for designing systems.
Chaos Engineering
System resiliency in practice by Casey Rosenthal and Nora Jones.
Implementing Service Level Objectives
Step-by-step guide to creating SLIs, SLOs, and error budgets by Alex Hidalgo.
Observability Engineering
Practical approach to achieving observability in distributed systems by Charity Majors, Liz Fong-Jones, and George Miranda.
Seeking SRE
Conversations about running production systems at scale, edited by David N. Blank-Edelman.
Site Reliability Engineering
The original Google SRE book, free to read online.

Capacity Planning - Google SRE Book
Google's approach to capacity planning in SRE.
KEDA
Kubernetes Event-driven Autoscaling component that provides fine-grained autoscaling for any container workload.
Kubernetes Horizontal Pod Autoscaler
Automatically scales the number of pods in a deployment based on observed metrics.
Kubernetes Vertical Pod Autoscaler
Automatically adjusts the amount of CPU and memory requested by pods.
KubeStellar Console
Multi-cluster Kubernetes dashboard with AI-powered operations, real-time observability, and CNCF project integrations across edge and cloud clusters.

CNCF TAG Observability
CNCF Technical Advisory Group for observability topics.
Google SRE Resources
Official talks, blog posts, and educational content from Google SRE teams.
SREcon
USENIX conference dedicated to Site Reliability Engineering.
SRE Weekly
Weekly newsletter curating the best SRE news and articles.

Showing a sample of 91 resources. View the full list on GitHub →