five

S1 Datasets -

收藏
Figshare2025-02-10 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/S1_Datasets_-/28383029
下载链接
链接失效反馈
官方服务:
资源简介:
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

近年来,机器学习领域中不平衡数据的问题愈发突出,严重影响分类算法的建模性能。本研究提出一种新型数据级过采样方法——基于聚类降噪的SMOTE(Cluster-Based Reduced Noise SMOTE,简称CRN-SMOTE),以解决该类问题。CRN-SMOTE将用于少数类过采样的SMOTE与新型聚类降噪技术进行融合。在该聚类降噪框架中,每个类别的样本需形成1至2个聚类簇,这一特性是传统降噪方法难以实现的。本研究选取ILPD、QSAR、Blood以及Maternal Health Risk共4个不平衡数据集,采用科恩卡帕系数(Cohen’s kappa)、马修斯相关系数(MCC)、F1值、精确率(precision)与召回率(recall)共5项评价指标,对所提方法展开性能评估。实验结果表明,在全部测试数据集上,CRN-SMOTE的性能均优于当前前沿的降噪SMOTE(Reduced Noise SMOTE,简称RN-SMOTE)、SMOTE-Tomek Link以及SMOTE-ENN方法;其中在QSAR与Maternal Health Risk数据集上的性能提升尤为显著,验证了该方法可有效改善不平衡分类任务的性能表现。整体而言,实验结果显示,在所有实验场景中,CRN-SMOTE的性能均全面超越RN-SMOTE;当将SMOTE的近邻数量设为5时,其在科恩卡帕系数上平均提升6.6%,马修斯相关系数上平均提升4.01%,F1值上平均提升1.87%,精确率上平均提升1.7%,召回率上平均提升2.05%。
创建时间:
2025-02-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作