sourander/yskills
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sourander/yskills
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
---
## Data Source
Original data has been downloaded from Data in Brief article dataset [Digital skills among youth: A dataset from a three-wave longitudinal survey in six European countries](https://www.sciencedirect.com/science/article/pii/S2352340924003652)
## Dataset Processing Summary
Modifications Applied:
1. Wave Consolidation: The original wide-format data (separate columns for waves 1-3) was converted to long format, concatenating all three waves into rows.
2. Target Variable: Only rows with valid RISK101 values (Experience with cyberhate in the past year) were retained.
3. Feature Selection: Cherry-picked 22 derived/aggregated columns plus 3 demographic columns (country, Age_year, GENDER), avoiding multicollinear features (e.g., kept lit_inf_pro but dropped skill_inf_pro).
4. Missing Value Handling: Negative values (representing various "not available" codes in ySKILLS) were replaced with NaN and imputed using mode (most frequent value).
5. Encoding & Scaling:
* Categorical columns (country, Age_year, GENDER): One-hot encoded (with first category dropped)
* Ordinal/continuous columns: Standardized using StandardScaler
* Binary columns (skill_progr, RISK101): Left unchanged
### Research Task
Binary classification: Predict `RISK101` (cyberhate experience) from digital skills and demographic features.
许可证:CC BY-NC 4.0(知识共享署名-非商业性使用4.0国际许可协议)
语言:英语
---
### 数据来源
原始数据下载自《Data in Brief》期刊论文数据集《Digital skills among youth: A dataset from a three-wave longitudinal survey in six European countries》(https://www.sciencedirect.com/science/article/pii/S2352340924003652)
### 数据集处理概述
所应用的修改如下:
1. **波次整合**:将原始宽格式数据(为3个调查波次分别设置独立列)转换为长格式,将三个波次的所有数据合并至行维度。
2. **目标变量筛选**:仅保留`RISK101`(过去一年遭遇网络仇恨言论经历)取值有效的样本行。
3. **特征选择**:精选22个衍生/聚合特征与3个人口统计特征(国家、Age_year、GENDER),规避多重共线性特征(例如保留`lit_inf_pro`,移除`skill_inf_pro`)。
4. **缺失值处理**:将ySKILLS数据集中代表各类“不可用”编码的负值替换为NaN,并通过众数(出现频率最高的值)进行插补。
5. **编码与标准化**:
* 分类特征(国家、Age_year、GENDER):采用独热编码(删除首个类别以避免虚拟变量陷阱)
* 有序/连续特征:使用StandardScaler进行标准化处理
* 二分类特征(`skill_progr`、`RISK101`):保持原样未作修改
### 研究任务
二分类任务:基于数字技能与人口统计特征预测`RISK101`(网络仇恨言论遭遇经历)。
提供机构:
sourander



