awesome-sre
github.com/dastergon/awesome-sre ↗A curated list of Site Reliability and Production Engineering resources.
13k
GitHub Stars
473
Curated Resources
20
Categories
20 hours ago
Last Refreshed
CultureEducationBooksHiringReliabilityMonitoring & Observability & AlertingOn-CallPost-MortemCapacity PlanningService Level AgreementPerformanceProgrammingMisc ArticlesReal-time MessagingBlogsNewslettersConferences & MeetupsTwitterSRE ToolsPodcasts
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me on-call resources from awesome-sre"
Installation instructions →What's inside
On-Call
- 10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use
- 3 Ways to Minimize the Impact of High Severity Incidents
- Advice to Management Teams While Enrolling Changes to On-Call Systems
- An Incident Command Training Handbook
- Atlassian Incident Handbook
- Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
Reliability
Books
- 97 Things Every SRE Should Know
- Antifragile Systems and Teams
- Building Secure and Reliable Systems
- Chaos Engineering: Building Confidence in System Behavior through Experiment
- Chaos Engineering: Crash test your applications
- Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications
Culture
Post-Mortem
Conferences & Meetups
- ADDO - All Day DevOps
A 24 hour conference that is completely online and free.
Misc Articles
- Adventures in SRE-land: Welcome to Google Mission Control
- Are site reliability engineers the next data scientists?
- Book Review: Site Reliability Engineering - How Google Runs Production Systems
How Google Runs Production Systems
- Building blameless working environment
- Commentary on Site Reliability Engineering
- Here’s How Google Makes Sure It (Almost) Never Goes Down
Showing a sample of 473 resources. View the full list on GitHub →