five

krishal07/student-performance

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/krishal07/student-performance
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 size_categories: - 10M<n<100M --- # Student Performance Dataset ## Dataset Description This dataset contains **ten million** synthetically generated student performance records, designed to mimic real-world educational data at the high-school level. It includes detailed demographic, socioeconomic, academic, behavioral, and school-context features for each student, suitable for benchmarking, machine learning, educational research, and exploratory data analysis. ## File Information | Split | File Name | Number of Records | Approx. Size | |-------------|--------------------------------------|---------------------------|--------------| | **Train** | `train.csv` | 8,000,774 | ~1.23 GB | | **Validation** | `validation.csv` | 999,229 | ~158 MB | | **Test** | `test.csv` | 999,997 | ~158 MB | ## Columns & Descriptions | Column Name | Description | | --------------------- | -------------------------------------------------------------------------------- | | **Age** | Student’s age in years (14–18). | | **Grade** | Grade level (9–12), derived from age. | | **Gender** | Student gender (`Female`, `Male`). | | **Race** | Race/ethnicity (`White`, `Hispanic`, `Black`, `Asian`, `Two-or-more`, `Other`). | | **SES_Quartile** | Socioeconomic status quartile (1 = lowest, 4 = highest). | | **ParentalEducation** | Highest education of parent/guardian (`<HS`, `HS`, `SomeCollege`, `Bachelors+`). | | **SchoolType** | Type of school attended (`Public`, `Private`). | | **Locale** | School location (`Suburban`, `City`, `Rural`, `Town`). | | **TestScore_Math** | Math achievement score (0–100). | | **TestScore_Reading** | Reading achievement score (0–100). | | **TestScore_Science** | Science achievement score (0–100). | | **GPA** | Cumulative Grade Point Average on a 0.0–4.0 scale. | | **AttendanceRate** | Fraction of school days attended (0.70–1.00). | | **StudyHours** | Average self-reported homework/study hours per day (0–4). | | **InternetAccess** | Home internet access (1 = yes, 0 = no). | | **Extracurricular** | Participation in clubs/sports (1 = yes, 0 = no). | | **PartTimeJob** | Holds a part-time job (1 = yes, 0 = no). | | **ParentSupport** | Regular parental help with homework (1 = yes, 0 = no). | | **Romantic** | Currently in a romantic relationship (1 = yes, 0 = no). | | **FreeTime** | Amount of free time after school on a scale from 1 (low) to 5 (high). | | **GoOut** | Frequency of going out with friends on a scale from 1 (low) to 5 (high). | ## Usage This dataset is ideal for: - **Educational Research:** Model how demographics and SES impact academic outcomes. - **Machine Learning:** Train and evaluate predictive models (regression, classification, ordinal) on GPA, test scores, or attendance. - **Clustering & Segmentation:** Identify student subgroups (e.g., high-achievers, at-risk) for targeted interventions. - **Fairness Analysis:** Examine performance disparities across sensitive groups (race, gender, SES). - **Policy Simulation:** Estimate the effects of interventions (e.g., increased study time, universal internet access). ## Example Workflows 1. **Supervised Regression:** Predict GPA from study hours, attendance rate, and parental education. 2. **Classification:** Identify students at risk of chronic absenteeism (AttendanceRate < 0.90). 3. **Clustering:** Segment students into performance-based clusters for personalized learning paths. 4. **Bias Mitigation:** Compare model performance across race/ethnicity groups and apply fairness techniques. ## Data Preprocessing Tips - **One-Hot Encoding:** For categorical features (`Gender`, `Race`, `SchoolType`, `Locale`, `ParentalEducation`). - **Ordinal Encoding:** Map `SES_Quartile`, `FreeTime`, `GoOut` directly to integers. - **Scaling:** Standardize continuous features (`TestScore_*`, `GPA`, `AttendanceRate`, `StudyHours`). - **Shuffle Splits:** Although splits are random, consider re-shuffling training data before each epoch. ## License This dataset is entirely synthetic and contains no real personal data. It is released under the [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/) license for any research, educational, or commercial use. ## Acknowledgements Generated to support reproducible educational data science and machine-learning research. Please cite this project if you use the data in your work.
提供机构:
krishal07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作