krishal07/student-performance
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/krishal07/student-performance
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
size_categories:
- 10M<n<100M
---
# Student Performance Dataset
## Dataset Description
This dataset contains **ten million** synthetically generated student performance records, designed to mimic real-world educational data at the high-school level. It includes detailed demographic, socioeconomic, academic, behavioral, and school-context features for each student, suitable for benchmarking, machine learning, educational research, and exploratory data analysis.
## File Information
| Split | File Name | Number of Records | Approx. Size |
|-------------|--------------------------------------|---------------------------|--------------|
| **Train** | `train.csv` | 8,000,774 | ~1.23 GB |
| **Validation** | `validation.csv` | 999,229 | ~158 MB |
| **Test** | `test.csv` | 999,997 | ~158 MB |
## Columns & Descriptions
| Column Name | Description |
| --------------------- | -------------------------------------------------------------------------------- |
| **Age** | Student’s age in years (14–18). |
| **Grade** | Grade level (9–12), derived from age. |
| **Gender** | Student gender (`Female`, `Male`). |
| **Race** | Race/ethnicity (`White`, `Hispanic`, `Black`, `Asian`, `Two-or-more`, `Other`). |
| **SES_Quartile** | Socioeconomic status quartile (1 = lowest, 4 = highest). |
| **ParentalEducation** | Highest education of parent/guardian (`<HS`, `HS`, `SomeCollege`, `Bachelors+`). |
| **SchoolType** | Type of school attended (`Public`, `Private`). |
| **Locale** | School location (`Suburban`, `City`, `Rural`, `Town`). |
| **TestScore_Math** | Math achievement score (0–100). |
| **TestScore_Reading** | Reading achievement score (0–100). |
| **TestScore_Science** | Science achievement score (0–100). |
| **GPA** | Cumulative Grade Point Average on a 0.0–4.0 scale. |
| **AttendanceRate** | Fraction of school days attended (0.70–1.00). |
| **StudyHours** | Average self-reported homework/study hours per day (0–4). |
| **InternetAccess** | Home internet access (1 = yes, 0 = no). |
| **Extracurricular** | Participation in clubs/sports (1 = yes, 0 = no). |
| **PartTimeJob** | Holds a part-time job (1 = yes, 0 = no). |
| **ParentSupport** | Regular parental help with homework (1 = yes, 0 = no). |
| **Romantic** | Currently in a romantic relationship (1 = yes, 0 = no). |
| **FreeTime** | Amount of free time after school on a scale from 1 (low) to 5 (high). |
| **GoOut** | Frequency of going out with friends on a scale from 1 (low) to 5 (high). |
## Usage
This dataset is ideal for:
- **Educational Research:** Model how demographics and SES impact academic outcomes.
- **Machine Learning:** Train and evaluate predictive models (regression, classification, ordinal) on GPA, test scores, or attendance.
- **Clustering & Segmentation:** Identify student subgroups (e.g., high-achievers, at-risk) for targeted interventions.
- **Fairness Analysis:** Examine performance disparities across sensitive groups (race, gender, SES).
- **Policy Simulation:** Estimate the effects of interventions (e.g., increased study time, universal internet access).
## Example Workflows
1. **Supervised Regression:** Predict GPA from study hours, attendance rate, and parental education.
2. **Classification:** Identify students at risk of chronic absenteeism (AttendanceRate < 0.90).
3. **Clustering:** Segment students into performance-based clusters for personalized learning paths.
4. **Bias Mitigation:** Compare model performance across race/ethnicity groups and apply fairness techniques.
## Data Preprocessing Tips
- **One-Hot Encoding:** For categorical features (`Gender`, `Race`, `SchoolType`, `Locale`, `ParentalEducation`).
- **Ordinal Encoding:** Map `SES_Quartile`, `FreeTime`, `GoOut` directly to integers.
- **Scaling:** Standardize continuous features (`TestScore_*`, `GPA`, `AttendanceRate`, `StudyHours`).
- **Shuffle Splits:** Although splits are random, consider re-shuffling training data before each epoch.
## License
This dataset is entirely synthetic and contains no real personal data. It is released under the [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/) license for any research, educational, or commercial use.
## Acknowledgements
Generated to support reproducible educational data science and machine-learning research. Please cite this project if you use the data in your work.
提供机构:
krishal07



