├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 ARISE Lab, CUHK 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # awesome-AIOps 2 | A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps). 3 | 4 | - [Researchers](#researchers) 5 | - [Industrial Materials](#industrial-materials) 6 | - [Competitions](#competitions) 7 | - [White Papers](#white-papers) 8 | - [Blogs & Tutorials & Magazines](#blogs--tutorials--magazines) 9 | - [Benchmarks](#benchmarks) 10 | - [Tools](#tools) 11 | - [Companies](#companies) 12 | - [Academic Materials](#academic-materials) 13 | - [Talks](#talks) 14 | - [Workshops](#workshops) 15 | - [Papers](#papers) 16 | - [Survey & Empirical Study](#survey--empirical-study) 17 | - [Benchmarks](#benchmarks) 18 | - [(Large) Language Models for IT Operations](#large-language-models-for-it-operations) 19 | - [Knowledge Graph for AIOps](#knowledge-graph-for-aiops) 20 | - [Microservices and Serverless](#microservices-and-serverless) 21 | - [Dependency and Tracing](#dependency-and-tracing) 22 | - [Anomaly/Failure Detection](#anomalyfailure-detection) 23 | - [Root Cause Analysis](#root-cause-analysis) 24 | - [Incident and Alarm Management](#incident-and-alarm-management) 25 | - [Node, Disk, and Storage](#node-disk-and-storage) 26 | - [VM Analysis and Management](#vm-analysis-and-management) 27 | - [Deployment](#deployment) 28 | - [Datasets](#datasets) 29 | - [Others](#others) 30 | - [Courses](#courses) 31 | 32 | ## Researchers 33 | | China (& HK SAR) | ||| 34 | | :---------| :------ | :------ | :------ | 35 | | [Michael R. Lyu](http://www.cse.cuhk.edu.hk/lyu/), CUHK | [Dongmei Zhang](https://www.microsoft.com/en-us/research/people/dongmeiz/), Microsoft | [Pengfei Chen](http://sdcs.sysu.edu.cn/content/3747), SYSU | [Dan Pei](https://netman.aiops.org/~peidan/), Tsinghua | 36 | | [Xin Peng](https://cspengxin.github.io/), Fudan |||| 37 | | **USA** |||| 38 | | [Ryan Huang](https://www.cs.jhu.edu/~huang/), JHU | [Yingnong Dang](https://scholar.google.com.hk/citations?user=InqtwxcAAAAJ&hl=en), Microsoft | [Christina Delimitrou](https://www.csl.cornell.edu/~delimitrou/), MIT EECS || 39 | | **Europe** ||||| 40 | | [Odej Kao](https://www.cit.tu-berlin.de/kao/), TU Berlin |||| 41 | | **Australia** |||| 42 | | [Hongyu Zhang](http://hongyujohn.github.io/), UON |||| 43 | 44 | 45 | ## Industrial Materials 46 | ### Competitions 47 | - [AIOps Challenge] [A series of AIOps competitions hosted by Tsinghua University](https://competition.aiops-challenge.com/home/competition) 48 | - [PAKDD2020] [Alibaba AIOps Competition](https://tianchi.aliyun.com/competition/entrance/231775/introduction?lang=en-us) 49 | 50 | ### White Papers 51 | - [VMware] [Proactive Incident and Problem Management](https://docplayer.net/8854482-Proactive-incident-and-problem-management.html) 52 | - [GREATOPS 高效运维社区] [《企业级 AIOps 实施建议》白皮书](https://pic.huodongjia.com/ganhuodocs/2018-04-16/1523873064.74.pdf) 53 | - [Awesome Open Source] [Aiops Handbook](https://awesomeopensource.com/project/chenryn/aiops-handbook) 54 | 55 | ### Blogs & Tutorials & Magazines 56 | - [Moogsoft] [What is AIOps?](https://www.moogsoft.com/resources/aiops/guide/everything-aiops/) 57 | - [Tsinghua University] [清华裴丹:AIOps落地的15条原则](https://mp.weixin.qq.com/s/Ov1gQlQ0mRpk58cNL_YlVg) 58 | - [Tsinghua University] [清华裴丹:AIOps效果落地最后一公里](https://mp.weixin.qq.com/s/VhaRfvjc839bAXBMfzv11g) 59 | - [Alibaba Cloud] [基于大数据的智能网络分析-齐天](https://developer.aliyun.com/article/590290) 60 | - [Microsoft] [Advancing Azure service quality with artificial intelligence: AIOps](https://azure.microsoft.com/en-us/blog/advancing-azure-service-quality-with-artificial-intelligence-aiops/) 61 | - [Grafana] [GrafanaCON: Grafana Observability Conference 2022](https://grafana.com/about/events/observabilitycon/2022/) 62 | - [InfoQ] [2023,可观测性需求将迎来“爆发之年”?](https://mp.weixin.qq.com/s/6na952N3c5RzcopanZGs6w) 63 | - [Alibaba] [阿里云张建锋谈新型计算体系:云正在重构硬件、软件和终端世界](https://mp.weixin.qq.com/s/IQvurZ_9Vm0SufV1K0sK1A) 64 | 65 | ### Benchmarks 66 | - [Cornell] [DeathStarBench (An open-source benchmark suite for cloud microservices)](https://github.com/delimitrou/DeathStarBench/tree/master) 67 | - [Google Cloud] [Online Boutique (A microservices demo application)](https://github.com/GoogleCloudPlatform/microservices-demo) 68 | - [Fudan] [Train Ticket (A benchmark microservice system)](https://github.com/FudanSELab/train-ticket) 69 | - [Weaveworks] [Sock Shop (A microservices demo application)](https://microservices-demo.github.io/) 70 | 71 | ### Tools 72 | - [Log Analytics] [LogPAI](https://github.com/logpai) 73 | - [AI for Cloud Operation] [OpsPAI](https://github.com/OpsPAI) 74 | - [Outlier Detection] [PyOD](https://github.com/yzhao062/pyod) 75 | - [Anomaly Detection] [ADTK](https://github.com/arundo/adtk) 76 | - [Anomaly Detection] [PySAD](https://github.com/selimfirat/pysad) 77 | - [Online Machine Learning] [River](https://riverml.xyz/) 78 | - [Online Machine Learning] [scikit-multiflow](https://scikit-multiflow.readthedocs.io/) 79 | - [Fault Injection] [Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh) 80 | - [Fault Injection] [ChaosBlade](https://github.com/chaosblade-io/chaosblade) 81 | - [Container Monitoring] [cAdvisor](https://github.com/google/cadvisor) 82 | - [Performance Monitoring] [Netdata](https://www.netdata.cloud/) 83 | - [Anomaly Detection Labeling Tool] [Microsoft TagAnomaly](https://github.com/Microsoft/TagAnomaly) 84 | - [Serverless App Dev. Framework] [AWS Serverless Application Model (AWS SAM)](https://github.com/aws/serverless-application-model) 85 | - [Performance Testing Tool] [Locust](https://locust.io/) 86 | - [Alibaba Java Diagnostic Tool] [Arthas](https://arthas.aliyun.com/) 87 | 88 | ### Companies 89 | - [Datadog](https://www.datadoghq.com/): A monitoring and security platform for cloud applications 90 | - [必示 bizseer](https://www.bizseer.com/) 91 | - [日志易](https://www.rizhiyi.com/) 92 | - [博睿数据](https://www.bonree.com/) 93 | - [听云 TINGYUN](https://www.tingyun.com/lp.html): 端到端的全平台应用性能管理系统 94 | - [Loom Systems](https://www.loomsystems.com/) 95 | - [Keep](https://www.keephq.dev): Open-source alert management and AIOps platform 96 | 97 | 98 | ## Academic Materials 99 | 100 | ### Talks 101 | - [Michael R. Lyu] [Reliability-Driven AIOps for Cloud Resilience (Keynote talk at ICSE '21)](http://ariselab.cse.cuhk.edu.hk/assets/files/ICSE2021_keynote_lyu.pdf) 102 | 103 | ### Workshops 104 | - [ICSE21 Workshop on Cloud Intelligence](http://cloudintelligenceworkshop.org/index.html) 105 | - [AAAI-20 Workshop on Cloud Intelligence](http://cloudintelligenceworkshop.org/2020/index.html) 106 | - [AIOPS 2020 (International Workshop on Artificial Intelligence for IT Operations)](https://aiopsworkshop.github.io/) 107 | 108 | 109 | ## Papers 110 | 111 | ### Survey & Empirical Study 112 | - [arXiv '24] [A Survey on Failure Analysis and Fault Injection in AI Systems](https://arxiv.org/abs/2407.00125) 113 | - [arXiv '23] [AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges](https://arxiv.org/abs/2304.04661) 114 | - [CSUR '22] [Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey](https://dl.acm.org/doi/full/10.1145/3501297) 115 | - [ASE '22] [Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling](https://github.com/IntelligentDDS/Post-mortems-Analysis) 116 | - [arXiv '21] [Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection](https://arxiv.org/abs/2107.05908) 117 | - [CSUR '21] [A Survey on Automated Log Analysis for Reliability Engineering](https://arxiv.org/abs/2009.07237) 118 | - [ESEC/FSE '20] [Towards intelligent incident management: why we need it and how we make it](https://dl.acm.org/doi/abs/10.1145/3368089.3417055) 119 | - [arXiv '20] [A Systematic Mapping Study in AIOps](https://arxiv.org/abs/2012.09108) 120 | - [ICSE '19] [AIOps: Real-World Challenges and Research Innovations](https://ieeexplore.ieee.org/document/8802836) 121 | - [HotOS '19] [What bugs cause production cloud incidents?](https://dl.acm.org/doi/10.1145/3317550.3321438) 122 | - [ISSRE '16] [Experience Report: System Log Analysis for Anomaly Detection](https://ieeexplore.ieee.org/abstract/document/7774521) 123 | - [ASE '13] [Software analytics for incident management of online services: An experience report](https://ieeexplore.ieee.org/document/6693105) 124 | 125 | ### Benchmarks 126 | - [arXiv '22] [Constructing Large-Scale Real-World Benchmark Datasets for AIOps](https://arxiv.org/abs/2208.03938) 127 | - [ASPLOS '19] [An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems](https://dl.acm.org/doi/10.1145/3297858.3304013) 128 | 129 | 130 | ### (Large) Language Models for IT Operations 131 | - [ISSTA '24] [LILAC: Log Parsing using LLMs with Adaptive Parsing Cache](https://arxiv.org/abs/2310.01796) 132 | - [arXiv '24] [Exploring LLM-based Agents for Root Cause Analysis](https://arxiv.org/abs/2403.04123) 133 | - [arXiv '24] [Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides](https://arxiv.org/abs/2402.17531) 134 | - [arXiv '24] [Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4](https://arxiv.org/abs/2401.13810) 135 | - [arXiv '23] [Automatic Root Cause Analysis via Large Language Models for Cloud Incidents](https://arxiv.org/abs/2305.15778) 136 | - [arXiv '23] [OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models](https://arxiv.org/abs/2310.07637) 137 | - [arXiv '23] [Xpert: Empowering Incident Management with Query Recommendations via Large Language Models](https://arxiv.org/abs/2312.11988) 138 | - [arXiv '23] [Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study](https://arxiv.org/abs/2307.05950) 139 | - [arXiv '23] [Assess and Summarize: Improve Outage Understanding with Large Language Models](https://arxiv.org/pdf/2305.18084.pdf) 140 | - [arXiv '23] [Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering](https://arxiv.org/pdf/2305.11541.pdf) 141 | - [arXiv '23] [Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models](https://arxiv.org/abs/2301.03797) 142 | - [SoCC '19] [A System-Wide Debugging Assistant Powered by Natural Language Processing](https://dl.acm.org/doi/10.1145/3357223.3362701) 143 | 144 | 145 | ### Knowledge Graph for AIOps 146 | - [ICSE-SEIP '22] [Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps](https://arxiv.org/abs/2204.11598) 147 | - [ICSE-SEIP '21] [Neural knowledge extraction from cloud service incidents](https://dl.acm.org/doi/abs/10.1109/ICSE-SEIP52600.2021.00031) 148 | - [arXiv '21] [SoftNER: Mining Knowledge Graphs From Cloud Incidents](https://arxiv.org/abs/2101.05961) 149 | - [APPLSCI '20] [A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications](https://www.mdpi.com/2076-3417/10/6/2166) 150 | 151 | 152 | ### Microservices and Serverless 153 | - [ASPLOS '21] [Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices](https://dl.acm.org/doi/abs/10.1145/3445814.3446700) 154 | - [ICDCS '21] [Defuse: A Dependency-Guided Function Scheduler to Mitigate Cold Starts on FaaS Platforms](https://ieeexplore.ieee.org/document/9546470) 155 | - [FSE '20] [Graph-based trace analysis for microservice architecture understanding and problem diagnosis](https://dl.acm.org/doi/10.1145/3368089.3417066) 156 | - [OSDI '20] [FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices](https://www.usenix.org/conference/osdi20/presentation/qiu) 157 | - [ESEC/FSE '19] [Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs](https://dl.acm.org/doi/10.1145/3338906.3338961) 158 | - [TSE '18] [Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study](https://ieeexplore.ieee.org/document/8580420/) 159 | 160 | 161 | ### Dependency and Tracing 162 | - [ASE '21] [AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems](https://arxiv.org/abs/2109.04893) [[code](https://github.com/OpsPAI/aid)] 163 | - [NSDI '07] [X-Trace: A Pervasive Network Tracing Framework](https://www.usenix.org/conference/nsdi-07/x-trace-pervasive-network-tracing-framework) 164 | - [HotNets '06] [Discovering Dependencies for Network Management](https://www.microsoft.com/en-us/research/wp-content/uploads/2006/11/hotnets06.pdf) 165 | 166 | 167 | ### Anomaly/Failure Detection 168 | - [POMACS '24] [The Tale of Errors in Microservices](https://dl.acm.org/doi/pdf/10.1145/3700436) [[data](https://zenodo.org/records/13947828)] 169 | - [ICSE '23] [CONAN: Diagnosing Batch Failures for Cloud Systems](http://windows-microsoft-en.com/research/uploads/prod/2022/12/Conan_ICSE23_CR.pdf) 170 | - [ISSRE '22] [Share or Not Share? Towards the Practicability of Deep Models for Unsupervised Anomaly Detection in Modern Online Systems](https://ieeexplore.ieee.org/document/9978953) [[code](https://github.com/IntelligentDDS/Uni-AD)] 171 | - [ICSE '22] [Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching](https://arxiv.org/abs/2201.02944) [[code](https://github.com/OpsPAI/ADSketch)] 172 | - [KDD '19] [Time-Series Anomaly Detection Service at Microsoft](https://dl.acm.org/doi/10.1145/3292500.3330680) 173 | - [ESEC/FSE '18] [Identifying Impactful Service System Problems via Log Analysis](https://dl.acm.org/doi/10.1145/3236024.3236083) 174 | - [CCS '17] [DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning](https://dl.acm.org/doi/10.1145/3133956.3134015) 175 | 176 | 177 | ### Root Cause Analysis 178 | - [SIGCOMM '23] [Murphy: Performance Diagnosis of Distributed Cloud Applications](https://dl.acm.org/doi/abs/10.1145/3603269.3604877) 179 | - [FSE '23] [Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data](https://dl.acm.org/doi/10.1145/3611643.3616249) 180 | - [OSDI '18] [Capturing and Enhancing In Situ System Observability for Failure Detection](https://www.usenix.org/conference/osdi18/presentation/huang) 181 | 182 | ### Incident and Alarm Management 183 | - [ATC '23] [AutoARTS: Taxonomy, Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure](https://www.usenix.org/conference/atc23/presentation/dogga) 184 | - [ICSE '23] [Incident-aware Duplicate Ticket Aggregation for Cloud Systems](https://arxiv.org/abs/2302.09520) 185 | - [SoCC '22] [How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service](https://dl.acm.org/doi/10.1145/3542929.3563482) 186 | - [DSN '22] [Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems](https://arxiv.org/abs/2204.09670) 187 | - [USENIX ATC '21] [Fighting the Fog of War: Automated Incident Detection for Cloud Systems](https://www.usenix.org/conference/atc21/presentation/li-liqun) 188 | - [ASE '21] [Graph-based Incident Aggregation for Large-Scale Online Service Systems](https://arxiv.org/abs/2108.12179) 189 | - [ASE '21] [Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings](https://arxiv.org/abs/2108.00344) 190 | - [SIGCOMM '20] [Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing](https://dl.acm.org/doi/10.1145/3387514.3405867) 191 | - [ASE '20] [How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems](https://dl.acm.org/doi/10.1145/3324884.3416624) 192 | - [ESEC/FSE '20] [Identifying linked incidents in large-scale online service systems](https://dl.acm.org/doi/10.1145/3368089.3409768) 193 | - [ESEC/FSE '20] [Efficient incident identification from multi-dimensional issue reports via meta-heuristic search](https://dl.acm.org/doi/abs/10.1145/3368089.3409741) 194 | - [ESEC/FSE '20] [Real-time incident prediction for online service systems](https://dl.acm.org/doi/abs/10.1145/3368089.3409672) 195 | - [ESEC/FSE '20] [How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems](https://dl.acm.org/doi/abs/10.1145/3368089.3417054) 196 | - [ICSE '20] [Understanding and Handling Alert Storm for Online Service Systems](https://dl.acm.org/doi/10.1145/3377813.3381363) 197 | - [HotOS '19] [What bugs cause production cloud incidents?](https://dl.acm.org/doi/10.1145/3317550.3321438) 198 | - [ASE '19] [Continuous Incident Triage for Large-Scale Online Service Systems](https://dl.acm.org/doi/10.1109/ASE.2019.00042) 199 | - [ICSE '19] [An empirical investigation of incident triage for online service systems](https://dl.acm.org/doi/10.1109/ICSE-SEIP.2019.00020) 200 | - [WWW '19] [Outage Prediction and Diagnosis for Cloud Service Systems](https://dl.acm.org/doi/10.1145/3308558.3313501) 201 | - [KDD '14] [Correlating Events with Time Series for Incident Diagnosis](https://dl.acm.org/doi/10.1145/2623330.2623374) 202 | 203 | 204 | ### Node, Disk, and Storage 205 | - [FAST '23] [Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems](https://www.usenix.org/conference/fast23/presentation/lu) [[data](https://tianchi.aliyun.com/dataset/144479)] 206 | - [DSN '21] [General Feature Selection for Failure Prediction in Large-scale SSD Deployment](https://ieeexplore.ieee.org/document/9505157) 207 | - [TOSEM '20] [Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution](https://dl.acm.org/doi/10.1145/3385187) 208 | - [ICDCS '20] [Toward Adaptive Disk Failure Prediction via Stream Mining](https://ieeexplore.ieee.org/document/9355640) 209 | - [VLDB '20] [Diagnosing root causes of intermittent slow queries in cloud databases](https://dl.acm.org/doi/abs/10.14778/3389133.3389136) 210 | - [USENIX ATC '19] [IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services](https://www.usenix.org/conference/atc19/presentation/panda) 211 | - [NSDI '18] [Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure](https://www.usenix.org/conference/nsdi18/presentation/zhang-qiao) 212 | - [ESEC/FSE '18] [Predicting Node Failure in Cloud Service Systems](https://dl.acm.org/doi/10.1145/3236024.3236060) 213 | - [USENIX ATC '18] [Improving Service Availability of Cloud Systems by Predicting Disk Error](https://www.usenix.org/conference/atc18/presentation/xu-yong) 214 | 215 | 216 | ### VM Analysis and Management 217 | - [NSDI '22] [CloudCluster: Unearthing the Functional Structure of a Cloud Service](https://www.usenix.org/conference/nsdi22/presentation/pang) 218 | - [OSDI '20] [Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions](https://www.usenix.org/conference/osdi20/presentation/levy) 219 | 220 | 221 | ### Deployment 222 | - [SOSP '21] [Understanding and Detecting Software Upgrade Failures in Distributed Systems](https://dl.acm.org/doi/10.1145/3477132.3483577) 223 | - [NSDI '20] [Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure](https://www.usenix.org/conference/nsdi20/presentation/li) 224 | 225 | 226 | ## Datasets 227 | - [CUHK] [Loghub](https://github.com/logpai/loghub) 228 | - [Microsoft Azure] [Azure Public Dataset](https://github.com/Azure/AzurePublicDataset) 229 | - [Tsinghua] [AIOps Challenge Dataset](http://iops.ai/dataset_list/) 230 | - [Google] [Cluster Traces](https://github.com/google/cluster-data) 231 | - [Backblaze] [Hard Drive Dataset](https://www.backblaze.com/b2/hard-drive-test-data.html) 232 | - [Baidu] [SMART Dataset of PAKDD CUP 2020](https://pan.baidu.com/share/link?shareid=189977&uk=4278294944#list/path=%2FS.M.A.R.T.dataset) 233 | - [Red Hat] [Ceph Device Telemetry Dataset](https://ceph.io/en/users/telemetry/device-telemetry/) 234 | - [Alibaba] [SSD SMART logs and failure data](https://github.com/alibaba-edu/dcbrain/tree/master/ssd_open_data) 235 | - [Alibaba] [Alibaba Cluster Trace Program](https://github.com/alibaba/clusterdata) 236 | - [CloudWise] [GAIA Dataset](https://github.com/CloudWise-OpenSource/GAIA-DataSet) 237 | - [Huawei Cloud] [Serverless traces](https://github.com/sir-lab/data-release?tab=readme-ov-file) 238 | 239 | 240 | ## Others 241 | 242 | ### Courses 243 | - [Coursera] [Cloud-Based Network Design & Management Techniques](https://www.coursera.org/learn/cloud-based-network-design-and-management) 244 | - [Tsinghua] [AIOps Course of Tsinghua](https://netman.aiops.org/courses/advanced-network-management-spring2021-course/) 245 | --------------------------------------------------------------------------------