awesome-sre
github.com/adriannovegil/awesome-sre ↗Awesome SRE page
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me 3. devops resources from awesome-sre"
Installation instructions →What's inside
3. DevOps
- Accelerate: State of DevOps
Google Cloud's annual State of DevOps report and resources.
- DevOps Roadmap
Community-driven roadmap for DevOps practitioners.
- DORA Metrics
Research program that identifies the capabilities that drive software delivery and operations performance.
- The DevOps Handbook
Practical guide for implementing DevOps in any organization.
- The Phoenix Project
A novel about IT, DevOps, and helping your business win.
6. Incident Response and Post-Mortem
- A collection of post-mortems
Curated collection of post-mortems from various companies and incidents.
- A collection of postmortem templates
Collection of templates for writing effective post-mortems.
- AlertOps
Transforms real-time operational intelligence into automated incident response.
- Bigpanda
AIOps event correlation and automation platform.
- Blameless
SRE platform for incident management, retrospectives, and reliability insights.
- Cabot
Get alerted when services go down or metrics go crazy.
9. Automation and Toil Reduction
- Ansible
Simple, agentless IT automation platform for configuration management, application deployment, and orchestration.
- Eliminating Toil - Google SRE Book
Chapter on identifying and reducing toil in SRE practice.
- Pulumi
Infrastructure as Code using familiar programming languages like Python, Go, JavaScript, TypeScript, and C#.
- Shoreline
Incident automation platform that enables on-call engineers to debug and repair production issues with real-time automation.
- StackStorm
Open source event-driven platform for runbook automation, ChatOps, and auto-remediation.
- Terraform
Infrastructure as Code tool for building, changing, and versioning infrastructure safely and efficiently.
7. On-Call
- Awesome On-Call
Collection of articles on how companies handle on-call.
- Being On-Call - Google SRE Book
Google's guide on on-call best practices and sustainable workloads.
- Grafana OnCall
Open source on-call management tool with calendar integration, escalation chains, and ChatOps.
- PagerDuty On-Call
Automated scheduling, escalation policies, and on-call reporting.
2. SRE Culture
- Blameless Post-Mortems
Chapter from the Google SRE book on creating a blameless post-mortem culture.
- Love DevOps? Wait until you meet SRE
Atlassian's take on the relationship between SRE and DevOps culture.
- SRE vs DevOps: What's the Difference?
Google Cloud blog post explaining the relationship between SRE and DevOps.
- What is the role of a Site Reliability Engineer?
Overview of the SRE role and responsibilities.
14. Books
- Building Secure and Reliable Systems
Combines security and reliability practices for designing systems.
- Chaos Engineering
System resiliency in practice by Casey Rosenthal and Nora Jones.
- Implementing Service Level Objectives
Step-by-step guide to creating SLIs, SLOs, and error budgets by Alex Hidalgo.
- Observability Engineering
Practical approach to achieving observability in distributed systems by Charity Majors, Liz Fong-Jones, and George Miranda.
- Seeking SRE
Conversations about running production systems at scale, edited by David N. Blank-Edelman.
- Site Reliability Engineering
The original Google SRE book, free to read online.
10. Capacity Planning
- Capacity Planning - Google SRE Book
Google's approach to capacity planning in SRE.
- KEDA
Kubernetes Event-driven Autoscaling component that provides fine-grained autoscaling for any container workload.
- Kubernetes Horizontal Pod Autoscaler
Automatically scales the number of pods in a deployment based on observed metrics.
- Kubernetes Vertical Pod Autoscaler
Automatically adjusts the amount of CPU and memory requested by pods.
- KubeStellar Console
Multi-cluster Kubernetes dashboard with AI-powered operations, real-time observability, and CNCF project integrations across edge and cloud clusters.
16. Community and Forums
- CNCF TAG Observability
CNCF Technical Advisory Group for observability topics.
- Google SRE Resources
Official talks, blog posts, and educational content from Google SRE teams.
- SREcon
USENIX conference dedicated to Site Reliability Engineering.
- SRE Weekly
Weekly newsletter curating the best SRE news and articles.
Showing a sample of 91 resources. View the full list on GitHub →