Skip to main content

Awesome SRE page

12
GitHub Stars
91
Curated Resources
17
Categories
23 hours ago
Last Refreshed
1. Site Reliability Engineering2. SRE Culture3. DevOps4. Monitoring and Observability5. Alerting6. Incident Response and Post-Mortem7. On-Call8. Chaos Engineering9. Automation and Toil Reduction10. Capacity Planning11. Runbooks and Playbooks12. Performance13. SLOs and SLIs Tools14. Books15. Examples and Sandboxes16. Community and Forums17. References

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me 3. devops resources from awesome-sre"

Installation instructions →

What's inside

3. DevOps

6. Incident Response and Post-Mortem

  • A collection of post-mortems

    Curated collection of post-mortems from various companies and incidents.

  • A collection of postmortem templates

    Collection of templates for writing effective post-mortems.

  • AlertOps

    Transforms real-time operational intelligence into automated incident response.

  • Bigpanda

    AIOps event correlation and automation platform.

  • Blameless

    SRE platform for incident management, retrospectives, and reliability insights.

  • Cabot

    Get alerted when services go down or metrics go crazy.

9. Automation and Toil Reduction

  • Ansible

    Simple, agentless IT automation platform for configuration management, application deployment, and orchestration.

  • Eliminating Toil - Google SRE Book

    Chapter on identifying and reducing toil in SRE practice.

  • Pulumi

    Infrastructure as Code using familiar programming languages like Python, Go, JavaScript, TypeScript, and C#.

  • Shoreline

    Incident automation platform that enables on-call engineers to debug and repair production issues with real-time automation.

  • StackStorm

    Open source event-driven platform for runbook automation, ChatOps, and auto-remediation.

  • Terraform

    Infrastructure as Code tool for building, changing, and versioning infrastructure safely and efficiently.

7. On-Call

2. SRE Culture

14. Books

10. Capacity Planning

16. Community and Forums

  • CNCF TAG Observability

    CNCF Technical Advisory Group for observability topics.

  • Google SRE Resources

    Official talks, blog posts, and educational content from Google SRE teams.

  • SREcon

    USENIX conference dedicated to Site Reliability Engineering.

  • SRE Weekly

    Weekly newsletter curating the best SRE news and articles.

Showing a sample of 91 resources. View the full list on GitHub →