five

sourander/yskills

收藏
Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sourander/yskills
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en --- ## Data Source Original data has been downloaded from Data in Brief article dataset [Digital skills among youth: A dataset from a three-wave longitudinal survey in six European countries](https://www.sciencedirect.com/science/article/pii/S2352340924003652) ## Dataset Processing Summary Modifications Applied: 1. Wave Consolidation: The original wide-format data (separate columns for waves 1-3) was converted to long format, concatenating all three waves into rows. 2. Target Variable: Only rows with valid RISK101 values (Experience with cyberhate in the past year) were retained. 3. Feature Selection: Cherry-picked 22 derived/aggregated columns plus 3 demographic columns (country, Age_year, GENDER), avoiding multicollinear features (e.g., kept lit_inf_pro but dropped skill_inf_pro). 4. Missing Value Handling: Negative values (representing various "not available" codes in ySKILLS) were replaced with NaN and imputed using mode (most frequent value). 5. Encoding & Scaling: * Categorical columns (country, Age_year, GENDER): One-hot encoded (with first category dropped) * Ordinal/continuous columns: Standardized using StandardScaler * Binary columns (skill_progr, RISK101): Left unchanged ### Research Task Binary classification: Predict `RISK101` (cyberhate experience) from digital skills and demographic features.

许可证:CC BY-NC 4.0(知识共享署名-非商业性使用4.0国际许可协议) 语言:英语 --- ### 数据来源 原始数据下载自《Data in Brief》期刊论文数据集《Digital skills among youth: A dataset from a three-wave longitudinal survey in six European countries》(https://www.sciencedirect.com/science/article/pii/S2352340924003652) ### 数据集处理概述 所应用的修改如下: 1. **波次整合**:将原始宽格式数据(为3个调查波次分别设置独立列)转换为长格式,将三个波次的所有数据合并至行维度。 2. **目标变量筛选**:仅保留`RISK101`(过去一年遭遇网络仇恨言论经历)取值有效的样本行。 3. **特征选择**:精选22个衍生/聚合特征与3个人口统计特征(国家、Age_year、GENDER),规避多重共线性特征(例如保留`lit_inf_pro`,移除`skill_inf_pro`)。 4. **缺失值处理**:将ySKILLS数据集中代表各类“不可用”编码的负值替换为NaN,并通过众数(出现频率最高的值)进行插补。 5. **编码与标准化**: * 分类特征(国家、Age_year、GENDER):采用独热编码(删除首个类别以避免虚拟变量陷阱) * 有序/连续特征:使用StandardScaler进行标准化处理 * 二分类特征(`skill_progr`、`RISK101`):保持原样未作修改 ### 研究任务 二分类任务:基于数字技能与人口统计特征预测`RISK101`(网络仇恨言论遭遇经历)。
提供机构:
sourander
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作