Mystic777/employee-burnout-turnover-prediction-800k
收藏Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Mystic777/employee-burnout-turnover-prediction-800k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- hr-analytics
- employee-dataset
- workforce
- attrition-prediction
- turnover-risk
- skills-dataset
- human-resources
- salary
- tabular-data
- non-commercial
- salary
- wages
- payroll
- compensation
- employee-turnover
- employee-retention
- attrition
- attrition-analysis
- attrition-prediction
- turnover
- turnover-risk
- workforce
- workforce-analytics
- workforce-planning
- workforce-data
- hr
- hr-data
- hr-analytics
- hr-metrics
- human-resources
- human-capital
- employee-performance
- employee-engagement
- employee-satisfaction
- job-satisfaction
- job-level
- job-role
- career-progression
- promotion-analysis
- career-paths
- staff-data
- staff-engagement
- workplace
- workplace-data
- workplace-analytics
- organizational-behavior
- organizational-data
- company-data
- employee-feedback
- team-performance
- team-sentiment
- collaboration
- skills
- technical-skills
- soft-skills
- training
- training-participation
- burnout
- burnout-risk
- stress-level
- well-being
- salary-data
- salary-distribution
- salary-analysis
- salary-range
- salary-structure
- wages-data
- wages-analysis
- pay
- pay-data
- pay-gap
- pay-analysis
- compensation-data
- compensation-analysis
- compensation-structure
- payroll-analysis
- payroll-dataset
- employee-data
- employee-dataset
- employee-analysis
- employee-attrition
- employee-churn
- employee-risks
- employee-stats
- employee-tracking
- employee-burnout
- employee-wellbeing
- employee-metrics
- employee-behavior
- employee-outcomes
- workforce-dataset
- workforce-risk
- workforce-behavior
- workforce-trends
- workforce-performance
- workforce-stats
- workforce-churn
- workforce-outcomes
- hr-dataset
- hr-insights
- hr-prediction
- hr-trends
- hr-forecasting
- hr-statistics
- hr-ml
- hr-ai
- hr-research
- hr-knowledge
- human-resources-data
- human-resources-analysis
- human-resources-insights
- workplace-attrition
- workplace-risk
- workplace-behavior
- workplace-outcomes
- workplace-performance
- workplace-stress
- workplace-turnover
- workplace-wellbeing
- organizational-analysis
- organizational-performance
- organizational-risk
- organizational-stats
- organizational-trends
- staff-analysis
- staff-turnover
- staff-retention
- staff-performance
- staff-satisfaction
- staff-metrics
- staff-wellbeing
- staff-trends
- burnout-analysis
- burnout-dataset
- burnout-prediction
- burnout-metrics
- burnout-trends
- satisfaction
- satisfaction-analysis
- satisfaction-data
- satisfaction-metrics
- job-metrics
- job-analysis
- job-turnover
- job-risk
- job-forecasting
- job-prediction
- attrition-risk
- attrition-data
- attrition-metrics
- attrition-trends
- attrition-forecast
- turnover-dataset
- turnover-data
- turnover-analysis
- turnover-trends
- turnover-forecast
- turnover-metrics
- retention
- retention-data
- retention-analysis
- retention-metrics
size_categories:
- 100K<n<1M
task_categories:
- tabular-regression
- text-classification
---
[](https://creativecommons.org/licenses/by-nc-nd/4.0/)
[](https://huggingface.co/datasets/BrotherTony/synthetic-employee-dataset)
[](https://huggingface.co/datasets)
<div align="center">
[](https://creativecommons.org/licenses/by-nc-nd/4.0/)
[](https://huggingface.co/datasets/BrotherTony/synthetic-employee-dataset)
[](https://huggingface.co/datasets)




</div>
---
<div align="center">
<h1>Synthetic Employee Dataset</h1>
<p><i>800,000+ employee records with real-world distributions for burnout prediction, turnover analysis, and HR analytics</i></p>
</div>
---
## What You Get
This isn't just another CSV dump. You're looking at **800K+ carefully engineered employee profiles** that mirror actual workforce dynamics, complete with performance metrics, burnout indicators, skill matrices, and behavioral personas. Think of it as a production-ready HR database that never existed but feels like it should have.
**The sweet spot?** All the statistical realism of real employee data, zero privacy concerns, zero compliance headaches.
---
## Quick Stats
<table>
<tr>
<td><b> Total Records</b></td>
<td>800,000+</td>
</tr>
<tr>
<td><b> Departments</b></td>
<td>38 unique divisions</td>
</tr>
<tr>
<td><b> Job Roles</b></td>
<td>300+ distinct positions</td>
</tr>
<tr>
<td><b> Employee Personas</b></td>
<td>12 behavioral archetypes</td>
</tr>
<tr>
<td><b> Features per Record</b></td>
<td>30+ attributes</td>
</tr>
<tr>
<td><b> Salary Range</b></td>
<td>$27K - $384K</td>
</tr>
</table>
---
## Why This Dataset is really really great
### For Data Scientists
- **Turnover prediction models** ready to train out of the box
- **Multi-class classification** on employee personas
- **Regression tasks** for salary, performance, and burnout scoring
- **Time-series potential** with tenure and progression metrics
### For HR Analytics
- Benchmark your workforce against 800K profiles
- Identify burnout patterns before they become exits
- Map skill gaps across departments and roles
- Test retention strategies without risking real employee data
### For Researchers
- Study work-life balance dynamics at scale
- Analyze communication patterns and team sentiment
- Explore correlations between workload, satisfaction, and performance
- Validate your models on realistic synthetic populations
## 🌟 Community Use Cases
1. **DSE 220: Employee Burnout Analysis**
*By Grant Wagener, Duy Nguyen, and Thomas Brehme (UCSD)*
This is a great example of an end to end ML project using this dataset. The team did a full analysis, fixed some class imbalances, and built a segmented Random Forest model.
They actually managed to get **91% recall** on identifying preventable turnover (burnout/salary issues). Worth a look if you want to see how to handle the data cleaning and feature engineering steps.
*https://github.com/Harry-Wagener/DSE-220*
2. **iMML Library Tutorial**
The official tutorial for this Python library uses this dataset to teach Multimodal Learning.
*https://github.com/ocbe-uio/imml*
---
## Dataset Anatomy
<details>
<summary><b> Click to expand full schema (30+ fields)</b></summary>
### Core Identity
- `employee_id` Unique synthetic identifier
- `role` Job title (300+ varieties)
- `job_level` Entry | Mid | Senior | Manager | Lead
- `department` Business unit (38 departments)
- `tenure_months` Time with company (1-357 months)
### Compensation
- `salary` Annual USD ($27K-$384K range)
### Performance Indicators
- `performance_score` Normalized 0-1 rating
- `project_completion_rate` Delivery success metric
- `goal_achievement_rate` Objective completion
- `role_complexity_score` Job difficulty rating
- `career_progression_score` Growth trajectory
### Wellbeing & Engagement
- `satisfaction_score` Employee happiness (0-1)
- `workload_score` Workload burden indicator
- `stress_level` Stress assessment (0-1)
- `burnout_risk` Burnout probability (0-1)
- `training_participation` L&D engagement
### Collaboration & Communication
- `team_sentiment` Team morale (0-1)
- `email_sentiment` Email tone analysis
- `slack_activity` Chat platform usage
- `meeting_participation` Meeting engagement
- `collaboration_score` Teamwork rating
- `communication_patterns` Persona-based style
### Skills & Competencies
- `technical_skills` Array of tech skills (4-9 per employee)
- `soft_skills` Array of soft skills (3-7 per employee)
### Attrition & Risk
- `left_company` Boolean exit flag
- `turnover_reason` Exit category (if applicable)
- `turnover_probability_generated` ML-predicted churn risk
- `risk_factors_summary` Low | Medium | High | Severe risk
### Behavioral Profiling
- `persona_name` 12 archetypes:
- OverachievingSprinter
- BurntOutStar
- QuietAchiever
- NewEnthusiast
- StrugglingLearner
- SteadyEddy
- ChangeResistor
- SeasonedExpert
- SocialCatalyst
- 9-to-5Clockwatcher
- AmbitiousClimber
- CorporateNavigator
(Persona Names are A.I. generated)
### Qualitative Data
- `recent_feedback` Anonymized employee comments (1000+ unique snippets)
- `overtime_hours` Extra hours logged (0-74 range)
</details>
---
## Sample Use Cases
```python
# imports
from datasets import load_dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# load the dataset
dataset = load_dataset("BrotherTony/employee-burnout-turnover-prediction-800k")
df = dataset["train"].to_pandas()
# predict who's about to quit (simple logistic regression)
X = df[['satisfaction_score', 'burnout_risk', 'workload_score']]
y = df['left_company']
model = LogisticRegression()
model.fit(X, y)
# find your burnout hotspots
burnout_hotspots = df.groupby('department')['burnout_risk'].mean().sort_values(ascending=False)
print(burnout_hotspots.head())
# identify skill gaps by role
skill_gaps = df.explode('technical_skills').groupby(['role', 'technical_skills']).size()
print(skill_gaps.head())
# correlation deep dive
sns.heatmap(df[['performance_score', 'satisfaction_score',
'workload_score', 'burnout_risk']].corr(),
annot=True, cmap="coolwarm")
plt.show()
```
---
## Technical Details
**Format:** Parquet (optimized for analytics)
**Encoding:** UTF-8
**Generation Method:** Synthetic-real hybrid using distribution matching and constraint satisfaction
---
## Licensing & Usage
<table>
<tr>
<td> <b>License</b></td>
<td><a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC-BY-NC-ND 4.0</a></td>
</tr>
<tr>
<td> <b>Allowed</b></td>
<td>Research • Education • Academic papers • Internal analysis • Model training • Portfolio projects</td>
</tr>
<tr>
<td> <b>Not Allowed</b></td>
<td>Commercial products • Selling the dataset • Redistributing modified versions</td>
</tr>
<tr>
<td> <b>Attribution</b></td>
<td>Required (cite this dataset)</td>
</tr>
</table>
**TL;DR:** Free for learning, research, and experimentation. Just give credit and don't sell it.
---
## Pro Tips
1. **Start with personas** The 12 behavioral archetypes are your cheat code for segmentation
2. **Burnout + satisfaction isn't linear** High performers can have high burnout (see: OverachievingSprinter)
3. **Feedback text is gold** 1000+ unique anonymized comments perfect for sentiment analysis
4. **Skills are arrays** Don't forget to explode those columns for proper analysis
5. **Turnover probability ≠ actual turnover** Two different signals, both useful
---
## FAQ
**Q: Is this real employee data?**
A: No. Zero real people. Synthetic generation based on real-world distributions and patterns.
**Q: Can I use this for my startup's HR tool?**
A: Not commercially. Research and internal testing only.
**Q: Why are some salaries weird?**
A: Reflects real market variance. Geography, experience, and role complexity create wide ranges.
**Q: What's with the personas?**
A: Behavioral archetypes based on performance + satisfaction + workload combos. Useful for clustering and prediction.
**Q: Can I redistribute this?**
A: No derivatives or redistribution. Link back here instead.
---
## Citation
If you use this dataset in your research or project, please cite:
```bibtex
@dataset{employee-burnout-turnover-prediction800k,
author = {BrotherTony},
title = {Synthetic Employee Dataset: 800K+ Records for HR Analytics},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/BrotherTony/synthetic-employee-dataset}
}
```
---
# Why does this dataset exist?
### The Story Behind This Dataset
This wasn't built in a weekend sprint or thrown together for clout. This dataset exists because of a very specific mission that evolved into something bigger.
### The Original Goal
Started this whole thing to build something commercially viable - specifically an AI product called the "Prevue Explainer." The stakes? Nothing less than securing a comfortable retirement for my parents, supporting my brothers, and eventually stepping away from the commercial world to live as a full-time pilgrim dedicated to service and learning.
High stakes meant high standards. Every data point, every structural decision, every edge case - all of it got the obsessive attention you'd give something when your family's future depends on it. That's why the psychological depth and strategic architecture here goes deeper than your typical synthetic dataset.
### The Pivot
Somewhere in the trenches - debugging impossible errors at 3 AM, architecting logic systems, wrestling with data integrity - something clicked. The process itself became the point. Building this thing awakened a genuine passion for the craft that went beyond the original mission.
Then came the bigger realization: the most valuable thing I'd created wasn't the potential product. It was the knowledge base itself. The reasoning frameworks, the ethical guardrails, the expert-level structures - all of it could do more good as a public resource than locked behind a commercial wall.
### Why Open Source??
So here we are. The mission didn't die - it evolved. Instead of one proprietary tool helping a limited audience, this dataset (and others like it) can help researchers, students, data scientists, and organisations anywhere. More direct impact. Wider reach. Better outcomes.
This is one of several datasets from that original project. Might be an earlier version since I'm still hunting down the others, but the core quality and intentionality? That's all here.
If you use this and it helps your work, that's the whole point. That's the mission fulfilled.
## Feedback & Contact
Spot something weird? Have ideas for v2? Open an issue or reach out through HuggingFace.
Built for the data science and HR analytics community.
---
## Citation
If you use this dataset, please cite:
BrotherTony (2025). Employee Burnout & Turnover Prediction 800k [Dataset]. Hugging Face.
License: CC-BY-NC-ND 4.0
<div align="center">
<i>Made with <3 | Employee harm Free™</i>
</div>
提供机构:
Mystic777



