awesome-failure-diagnosis
github.com/phamquiluan/awesome-failure-diagnosis ↗Awesome resources for failure diagnosis research.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me others paper resources from awesome-failure-diagnosis"
Installation instructions →What's inside
Others Paper
- 2020 - Loghub: a large collection of system log datasets towards automated log analytics.
Loghub: a large collection of system log datasets towards automated log analytics.
- 2022 - Constructing Large-Scale Real-World Benchmark Datasets for AIOps
Constructing Large-Scale Real-World Benchmark Datasets for AIOps
- 2024 - A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends
A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends
- 2024 - Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis
Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis
- ASE'22 - Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems
Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems
- ASE'22 - WOLFFI: A fault injection platform for learning AIOps models.
WOLFFI: A fault injection platform for learning AIOps models.
Root Cause Analysis / Fault Localization
- 2025 - LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis
LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis
- ASE'24 - Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization
Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization
- ASE'24 - MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
- ASE'24 - Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?
Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?
- FSE'19 - Latent error prediction and fault localization for microservice applications by learning from system trace logs
Latent error prediction and fault localization for microservice applications by learning from system trace logs
- FSE'20 - Graph-based trace analysis for microservice architecture understanding and problem diagnosis.
Graph-based trace analysis for microservice architecture understanding and problem diagnosis.
Researcher
- Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE
The Good, the Bad and the Ugly: The 3 Learnings of an SRE
- Assoc Prof. Dan Pei - Tsinghua University
Tsinghua University
- Assoc. Prof. Pengfei Chen - Sun Yat-sen University
Sun Yat-sen University
- Banking on Continuous Delivery - Capital One
Capital One
- Causal Inference Course Lectures - Brady Neal
Brady Neal
- Dr. Dongmei Zhang - Microsoft Asia Research
Microsoft Asia Research
Misc
- Alibaba Cloud - https://status.alibabacloud.com/
https://status.alibabacloud.com/
- A list of security log data.
- Apache log files.
- Atlassian
- AWS Health Dashboard - https://health.aws.amazon.com/health/status
https://health.aws.amazon.com/health/status
- AWS Observability Recipes
Anomaly Detection
- ATC'21 - {Jump-Starting} Multivariate Time Series Anomaly Detection for Online Service Systems
{Jump-Starting} Multivariate Time Series Anomaly Detection for Online Service Systems
- CCS'17 - DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
- ICSE'16 - Behavioral Log Analysis with Statistical Guarantees
Behavioral Log Analysis with Statistical Guarantees
- ICSE'21 - Log-based Anomaly Detection with Deep Learning: How Far Are We?
Log-based Anomaly Detection with Deep Learning: How Far Are We?
- IMC'15 - Opprentice: Towards practical and automatic anomaly detection through machine learning.
Opprentice: Towards practical and automatic anomaly detection through machine learning.
- Robust multimodal failure detection for microservice systems
Metrics
- cAdvisor (Container Advisor)
- https://prometheus.io/docs
- Prometheus - Blackbox prober exporter
Blackbox prober exporter
- Prometheus - Node Exporter
Node Exporter
- tsfresh
Chaos Engineering / Fault Injection
Showing a sample of 151 resources. View the full list on GitHub →