Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems

Name: Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems
Creator: figshare
Published: 2025-05-01 10:33:37
License: 暂无描述

DataCite Commons2025-05-01 更新2025-05-07 收录

下载链接：

https://figshare.com/articles/dataset/Synthetic_Data_Generation_for_Hard_Drive_Failure_Prediction_in_Large-scale_Systems/28878830/1

下载链接

链接失效反馈

官方服务：

资源简介：

Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.

准确的故障预测对于高性能计算（HPC）设施及数据中心存储系统的可靠性至关重要。本研究通过合成数据生成技术，解决了硬盘驱动器（HDD）故障数据集中存在的数据稀缺、隐私顾虑及类别不平衡问题。我们提出一种端到端框架，利用生成对抗网络（Generative Adversarial Networks）和扩散模型（Diffusion models）生成合成存储数据。我们采用考虑磁盘访问时间变化的数据分割方法，生成能够复现磁盘故障细微时间模式与特征特定模式的高保真合成数据。实验结果表明，合成数据的相似度得分达0.81–0.89，且可提升故障预测性能：准确率最高提升3%，ROC-AUC指标最高提升2%。

提供机构：

figshare

创建时间：

2025-04-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集