Experimental Dataset for Imbalanced Classification: Application of Relabeling & Ranking Algorithm
收藏DataCite Commons2025-05-01 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/pb3jd7vz9z
下载链接
链接失效反馈官方服务:
资源简介:
The datasets in the study "Relabeling & Raking Algorithm for Imbalanced Classification" were sourced from several public repositories, including 1) Knowledge Extraction based on Evolutionary Learning data repository (J. Alcal´a-Fdez and A. Fernandez and J. Luengo and J. Derrac and S. Garc´ıa and L. S´anchez and F. Herrera, 16 2011), 2) UCI machine learning repository (Dua and Graff, 2017), 3) HDDT collection (Cieslak et al., 2012) and 4) previous studies (Radivojac et al., 2004; Kubat et al., 1998; WOODS et al., 1993). These datasets are particularly notable for their imbalanced nature and are widely recognized in academic literature for this feature.
Two main criteria were used to select these datasets:
Large-Scale Focus: Preference was given to large-scale datasets, a category often overlooked in previous studies. This selection includes datasets with more than 1,000 instances, with 10 of the 16 real-world datasets exceeding this threshold and four having over 10,000 instances.
High Imbalance Ratio (IR): The primary focus was on highly imbalanced datasets, specifically those with an IR greater than 9.
The datasets were categorized based on the types of feature variables they contain:
Continuous datasets: All feature variables are continuous.
Categorical datasets: All feature variables are categorical.
Mixed datasets: A combination of continuous and categorical feature variables.
本研究《面向不平衡分类的重标记与排序算法》所使用的数据集来源于多个公开数据集仓库,具体包括:1) 基于进化学习的知识提取数据集仓库(J. Alcalá-Fdez、A. Fernandez、J. Luengo、J. Derrac、S. García、L. Sánchez、F. Herrera,2011年);2) UCI机器学习仓库(Dua与Graff,2017年);3) HDDT数据集集(Cieslak等,2012年);4) 既往研究数据集(Radivojac等,2004年;Kubat等,1998年;WOODS等,1993年)。这些数据集因具备不平衡数据特性而极具研究价值,且该特性已在学术文献中得到广泛认可。
本次研究选取数据集时遵循两大核心标准:
1. 大规模优先原则:优先选取既往研究中常被忽视的大规模数据集。本次选取的数据集样本量均超过1000条,其中16个真实世界数据集中有10个突破该阈值,另有4个样本量超过10000条。
2. 高不平衡比(Imbalance Ratio,IR)原则:本次研究的核心关注对象为高不平衡数据集,具体为不平衡比大于9的数据集。
本次研究依据数据集所含特征变量的类型将其划分为三类:
连续型数据集:所有特征变量均为连续型变量;
分类型数据集:所有特征变量均为分类型变量;
混合型数据集:同时包含连续型与分类型特征变量。
提供机构:
Mendeley Data
创建时间:
2024-01-16



