The repository includes the source code of measurement for server failures due to DRAM errors and source code of prediction for such server failures based on the dataset of DRAM errors and server failures collected at Alibaba.
Zhinan Cheng, Shujie Han, Patrick P. C. Lee, Xin Li, Jiongzhou Liu, and Zhan Li. "An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers." Proceedings of the 41st International Symposium on Reliable Distributed Systems (SRDS 2022), September 2022.
data/
(Dataset downloaded from Alibaba), spans eight months, includingmcelog.tar.gz
: the collected mcelog that describes details of DRAM errors.inventory.tar.gz
: the inventory log that describes the hardware configurations of DIMMs and servers.trouble_ticket.tar.gz
: the trouble tickets that describes server failures due to DRAM errors.
measurement/
: includes the source code of the predictable analysis of server failures due to DRAM errors.prediciton/
: includes the source code of our workflow for server failure prediction.model/
: includes all the well-trained models we used in our paper.
Please email to Zhinan Cheng ([email protected]) if you have any questions.