A Multi-Source Integrated Benchmark for Student Academic Risk Prediction (Education)
收藏Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/8tvbwh3gvb
下载链接
链接失效反馈官方服务:
资源简介:
The ARPS Integrated Dataset serves as a high-fidelity, multidimensional benchmark for Educational Data Mining (EDM), consolidating 11,523 records across 40 features. Engineered through a SQL Server 2022 ETL pipeline, it synthesizes six heterogeneous sources—including LMS behavioral logs (xAPI-Edu-Data), secondary-school performance records (UCI), and multi-factor performance datasets (Kaggle) into a unified, cohesive analytical repository.
Architected under a Star Schema, the dataset ensures a 0% null rate and high structural integrity. It offers a holistic student profile across seven thematic domains: Identifiers, Demographics, Academic Performance, Behavioral Engagement, Family/Social Background, Study Characteristics, and Environmental Factors. This integration enables deep correlation analysis between digital behaviors, socioeconomic status, and academic outcomes. By providing a unified view of these disparate factors, the repository supports the development of predictive frameworks to categorize students into tiered risk levels (Low, Medium, and High).
The dataset includes mixed data types: numerical (grades, GPA, absences), categorical (gender, parental education), and identifier fields. It is distributed in multiple formats, including SQL Server tables and UTF-8 encoded CSV files, ensuring compatibility with Python (pandas), R, and standard statistical tools.
ARPS 集成数据集(ARPS Integrated Dataset)是教育数据挖掘(Educational Data Mining, EDM)领域的高保真多维度基准数据集,共包含11523条记录与40项特征。该数据集通过SQL Server 2022的抽取-转换-加载(Extract-Transform-Load, ETL)流程构建,整合了6个异构数据源——包括学习管理系统(Learning Management System, LMS)行为日志(xAPI-Edu-Data)、中学学业成绩记录(UCI)以及多因素学业表现数据集(Kaggle),最终形成一个统一且连贯的分析型资源库。
该数据集采用星型模式(Star Schema)架构设计,实现了0%的空值率与极高的结构完整性。它涵盖了七大主题维度下的完整学生画像,分别为标识符、人口统计学特征、学业表现、行为参与度、家庭/社会背景、学习特征以及环境因素。这种整合支持对数字行为、社会经济地位与学业结果之间进行深度关联分析。通过为这些异质性因素提供统一视角,该资源库可支撑预测框架的开发,从而将学生划分为低、中、高三级风险等级。
该数据集包含混合数据类型:数值型数据(成绩、平均学分绩点(Grade Point Average, GPA)、缺勤次数)、分类型数据(性别、父母受教育程度)以及标识符字段。
该数据集以多种格式发布,包括SQL Server数据表与UTF-8编码的CSV文件,可兼容Python(pandas库)、R语言及主流统计工具。
创建时间:
2026-04-27



