developer-productivity-simulated-behavioral-data
收藏魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/syncora/developer-productivity-simulated-behavioral-data
下载链接
链接失效反馈官方服务:
资源简介:
# *Synthetic AI Developer Productivity Dataset — Behavioral + Cognitive Simulation*
*A synthetic data generation resource for modeling behavioral and cognitive dynamics in developers.*
---
## 📘 **About This Dataset**
This dataset simulates productivity data from **AI-assisted software developers**. It blends behavioral signals, physiological inputs, and productivity metrics to explore the nuanced relationships between **deep work, distractions, caffeine, AI usage, and cognitive strain**.
Created to **push the boundaries of applied machine learning in behavioral productivity modeling**, this **synthetic data generation** dataset is perfect for:
✔ **Binary Classification**
✔ **Regression**
✔ **Clustering**
✔ **Time-Series Analysis**
✔ **Exploratory Data Analysis (EDA)**
**Keyword Focus:** generate synthetic dataset, developer productivity, cognitive modeling, behavioral simulation for LLM training.
---
## ✅ **What’s in This Repo?**
This repository contains:
- ✅ **Synthetic Developer Productivity Dataset (CSV)** → [Download Here](https://huggingface.co/datasets/syncora/developer-productivity-simulated-behavioral-data/blob/main/Developer_Productivity_Synthetic_Syncora.csv)
- ✅ **Example Jupyter Notebook for Analysis & Modeling** → [Open Notebook](https://huggingface.co/datasets/syncora/developer-productivity-simulated-behavioral-data/blob/main/developer-productivity%20(1)%20(2)_clean.ipynb)
- ✅ **Documentation & Use Cases** for building behavioral ML models.
---
## 🧠 **Use Case Context**
Modern developers balance AI tools, caffeine, sleep, and coding time while navigating digital distractions. This dataset allows you to simulate and model these real-world trade-offs:
- **How much does AI assistance impact productivity?**
- **What’s the relationship between caffeine, sleep, and bugs?**
- **Can we cluster different developer working styles over time?**
This is a **prime example of synthetic data generation applied to real-world behavioral modeling**.
---
## 🔧 **Dataset Format & Details**
- **Records:** 500 synthetic daily logs
- **Structure:** Tabular CSV (one row per day)
- **Target column:** `task_success` (0 = goal not achieved, 1 = goal achieved)
- **Date simulation:** Optional (rolling analysis possible via indexing)
---
## 📁 **Feature Descriptions**
| Column | Description |
|----------------------|--------------------------------------------------------------|
| **hours_coding** | Focused hours of raw coding per day (0–12) |
| **coffee_intake_mg**| Caffeine intake in milligrams (0–600) |
| **distractions** | Daily distractions count (Slack, meetings, etc.) (0–10) |
| **sleep_hours** | Hours of sleep the previous night (3–10) |
| **commits** | Number of commits pushed during the day (0–20) |
| **bugs_reported** | Bugs reported in the day’s code (0–10) |
| **ai_usage_hours** | Hours using AI tools (e.g., ChatGPT, Copilot) (0–12) |
| **cognitive_load** | Self-reported mental load/stress (1–10 scale) |
| **task_success** | Target: Whether goal was achieved (1 = Yes, 0 = No) |
---
## 🔍 **Suggested ML Tasks**
- 🟢 **Binary Classification** — Predict `task_success` using behavioral signals
- 📈 **Regression** — Model `cognitive_load` or `commits`
- 🌀 **Clustering** — Identify developer work-style clusters (e.g., high caffeine + low bugs)
- 📊 **Correlation Analysis** — What drives productivity or burnout?
- 📆 **Time Series** — Simulate trends with moving averages
- 🧼 **Feature Engineering** — Scale, normalize, encode for pipelines
**This dataset is an excellent resource for experimenting with synthetic data generation in applied ML workflows.**
---
## 💡 **Modeling Inspiration**
- Does more AI usage mean more commits or fewer bugs?
- Is there a sweet spot for **caffeine intake and sleep** that maximizes output?
- Can you build a model to **alert burnout before it happens** using cognitive load?
- What’s the real impact of distractions on coding effectiveness?
**Leverage this dataset to explore novel approaches in synthetic data generation for cognitive and productivity modeling.**
---
## ✅ **Notes**
✔ This dataset is **100% synthetic**, generated to reflect realistic developer behavior based on tech industry trends, research literature, and productivity heuristics.
✔ Ideal for **safe, public, and exploratory use** in:
- Workplace analytics
- Developer productivity tools
- Human-centered AI research
**Build smarter productivity models. Understand the cognitive rhythms of modern developers. All without the privacy risks of real-world logs.**
---
## 🚀 **Generate your own synthetic data **
[**👉 Use Our API**](https://huggingface.co/spaces/syncora/synthetic-generation)
---
# *合成AI开发人员生产力数据集——行为+认知模拟*
*用于建模开发人员行为与认知动态的合成数据生成资源*
---
## 📘 **关于本数据集**
本数据集模拟了**AI辅助软件开发人员(AI-assisted software developers)**的生产力数据,融合了行为信号、生理输入与生产力指标,以探究**深度工作(deep work)、干扰因素、咖啡因摄入、AI使用与认知紧张(cognitive strain)**之间的细微关联。
本**合成数据生成(synthetic data generation)**数据集旨在突破行为生产力建模领域应用机器学习的边界,适用于以下场景:
✔ 二分类任务
✔ 回归任务
✔ 聚类任务
✔ 时间序列分析
✔ 探索性数据分析(Exploratory Data Analysis, EDA)
**关键词聚焦:** 合成数据集生成、开发人员生产力、认知建模、用于大语言模型(Large Language Model, LLM)训练的行为模拟。
---
## ✅ **本仓库包含内容**
本仓库包含以下内容:
- ✅ **合成开发人员生产力数据集(CSV格式)** → [点击下载](https://huggingface.co/datasets/syncora/developer-productivity-simulated-behavioral-data/blob/main/Developer_Productivity_Synthetic_Syncora.csv)
- ✅ **用于分析与建模的Jupyter示例笔记本** → [打开笔记本](https://huggingface.co/datasets/syncora/developer-productivity-simulated-behavioral-data/blob/main/developer-productivity%20(1)%20(2)_clean.ipynb)
- ✅ **用于构建行为机器学习模型的文档与用例**
---
## 🧠 **用例背景**
当代软件开发人员在应对数字干扰的同时,需要平衡AI工具使用、咖啡因摄入、睡眠与编码时长之间的关系。本数据集可用于模拟并建模这些真实存在的权衡场景:
- **AI辅助对生产力的影响程度如何?**
- **咖啡因摄入、睡眠与代码缺陷之间存在何种关联?**
- **能否基于时间维度对不同开发人员的工作风格进行聚类?**
本数据集是将合成数据生成应用于真实行为建模的绝佳范例。
---
## 🔧 **数据集格式与详情**
- **记录数:** 500条合成每日日志
- **结构:** 表格型CSV文件(每日一条记录)
- **目标列:** `task_success`(0 = 未达成目标,1 = 已达成目标)
- **日期模拟:** 可选(可通过索引进行滚动分析)
---
## 📁 **特征说明**
| 列名 | 描述 |
|----------------------|--------------------------------------------------------------|
| **hours_coding** | 每日专注编码时长(0–12小时) |
| **coffee_intake_mg**| 每日咖啡因摄入量(单位:毫克,0–600毫克) |
| **distractions** | 每日干扰次数(含Slack消息、会议等,0–10次) |
| **sleep_hours** | 前一晚睡眠时长(3–10小时) |
| **commits** | 当日代码提交次数(0–20次) |
| **bugs_reported** | 当日代码中检出的缺陷数量(0–10个) |
| **ai_usage_hours** | 当日使用AI工具的时长(如ChatGPT、Copilot等,0–12小时) |
| **cognitive_load** | 自我报告的精神负荷/压力(1–10分量表) |
| **task_success** | 目标变量:是否达成当日目标(1 = 是,0 = 否) |
---
## 🔍 **建议机器学习任务**
- 🟢 **二分类任务** — 基于行为信号预测`task_success`
- 📈 **回归任务** — 对`cognitive_load`或`commits`进行建模
- 🌀 **聚类任务** — 识别开发人员工作风格聚类(如高咖啡因摄入+低缺陷产出类型)
- 📊 **相关性分析** — 哪些因素驱动生产力或职业倦怠?
- 📆 **时间序列分析** — 通过移动平均模拟趋势变化
- 🧼 **特征工程** — 针对机器学习流水线进行特征缩放、归一化与编码
**本数据集是在应用机器学习工作流中开展合成数据生成实验的优质资源。**
---
## 💡 **建模灵感**
- 更多的AI使用时长是否会带来更多代码提交或更少缺陷?
- **咖啡因摄入与睡眠**是否存在最优平衡点以最大化产出?
- 能否基于认知负荷构建模型,**在职业倦怠发生前发出预警**?
- 干扰因素对编码效率的实际影响如何?
**利用本数据集,可探索用于认知与生产力建模的合成数据生成新方法。**
---
## ✅ **说明**
✔ 本数据集**100%为合成数据**,其生成依据技术行业趋势、研究文献与生产力经验法则,旨在反映真实的开发人员行为模式。
✔ 适用于以下场景的**安全、公开、探索性研究**:
- 职场分析
- 开发人员生产力工具
- 以人为中心的AI研究
**构建更智能的生产力模型,理解当代开发人员的认知节律,且无需承担真实日志带来的隐私风险。**
---
## 🚀 **生成您自己的合成数据**
[**👉 使用我们的API**](https://huggingface.co/spaces/syncora/synthetic-generation)
提供机构:
maas
创建时间:
2025-08-31



