sii-research/HACHIMI-1M
收藏Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/sii-research/HACHIMI-1M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
language:
- zh
- en
---
# HACHIMI Student Profile Dataset
> **HACHIMI (Human-centric Agent-based Character and Holistic Individual Modeling Infrastructure)**
> A comprehensive student profile dataset generated using multi-agent collaboration
---
## 📊 Dataset Overview
**Size**: 10,000 student profile records
**Format**: JSONL (one JSON object per line)
**Language**: Chinese(merged_students_10k.jsonl), English(merged_students_10k_EN.jsonl)
**Encoding**: UTF-8
---
## 🎯 Dataset Features
This dataset contains **10,000 comprehensive, validated, and diverse Chinese student profiles**, covering different age groups from elementary to high school (ages 6-18). The profiles have been rigorously tested against real-world educational datasets (CEPS/PISA) to ensure authenticity and validity.
Each record includes the following complete information:
### 1️⃣ Basic Attributes
- **Age** & **Gender** ⚠️ *Note: Names have been removed for privacy protection*
- **Grade** (Grades 1-12)
- **Developmental Stages** (Piaget cognitive, Erikson psychosocial, Kohlberg moral development)
### 2️⃣ Academic Profile
- **Academic Level** (High/Medium/Low/Poor - four fixed categories)
- **Strong Subjects** & **Weak Subjects**
- Subjects include: Chinese, Mathematics, English, Physics, Chemistry, Biology, History, Geography, Politics, Arts, Music, P.E., Information Technology, etc.
### 3️⃣ Psychological & Personality Traits
- **Personality** (detailed character description)
- **Values** (7 dimensions)
- Moral Character
- Physical & Mental Health
- Legal Awareness
- Social Responsibility
- Political Identity
- Cultural Literacy
- Family Values
- **Mental Health** (structured psychological assessment)
### 4️⃣ Social & Creativity
- **Social Relationships** (peer interactions, family support, social support)
- **Creativity** (8 dimensions)
- Fluency, Novelty, Flexibility, Feasibility
- Problem Discovery, Problem Analysis, Solution Proposal, Solution Improvement
---
## 📝 Sample Data-CN
```json
{
"id": 1,
"年龄": 13,
"性别": "女",
"年级": "初一",
"发展阶段": {
"皮亚杰认知发展阶段": "形式运算阶段",
"埃里克森心理社会发展阶段": "身份与角色混淆",
"科尔伯格道德发展阶段": "习俗水平"
},
"擅长科目": ["美术"],
"薄弱科目": ["数学", "英语", "物理"],
"学术水平": "差:成绩全校排名后50%",
"人格": "该生性格偏内向,乐于与熟悉的同学分享美术创作...",
"价值观": "在道德修养上,尊重同学,偶尔愿意帮助身边朋友...",
"社交关系": "在初一的学习生活中与同班同学关系较为平和...",
"创造力": "流畅性表现中等,绘画时能较为顺畅地表达个人主题...",
"心理健康": "整体心理状态偏向敏感,习惯自我观察..."
}
```
---
## 📝 Sample Data-EN
```json
{
"id": 1,
"age": 13,
"gender": "Female",
"grade": "Grade 7",
"developmental_stage": {
"piaget_cognitive_stage": "Formal Operational Stage",
"erikson_psychosocial_stage": "Identity vs. Role Confusion",
"kohlberg_moral_stage": "Conventional Level"
},
"strengths": ["Art"],
"weaknesses": ["Mathematics", "English", "Physics"],
"academic_level": "Poor: ranked in the bottom 50% of the school",
"personality": "This student is relatively introverted and enjoys sharing her artwork with familiar classmates...",
"values": "In terms of moral character, she respects her classmates and is occasionally willing to help friends around her...",
"social_relationships": "During her first year of middle school, she maintains relatively harmonious relationships with her classmates...",
"creativity": "Her fluency is at a moderate level, and when drawing, she is generally able to express her personal themes smoothly...",
"mental_health": "Overall, her psychological state tends to be somewhat sensitive, and she is accustomed to self-observation..."
}
```
---
## 🏗️ Generation Methodology
This dataset is generated by the **HACHIMI Multi-Agent Collaboration System** using the following technologies:
### System Architecture
- **5 Specialized Agents** working collaboratively
1. Enrollment & Development Agent
2. Academic Profile Agent
3. Personality & Values Agent
4. Social & Creativity Agent
5. Mental Health Agent
### Quality Assurance
- **Two-Stage Validation** (Fast Validator + Deep Validator)
- **15 Validation Rules** (R1-R15) covering:
- Age-grade consistency
- Developmental stage alignment
- Cross-field consistency
- Academic level distribution
- Structural integrity
- **SimHash Deduplication** (Hamming distance threshold: 3)
- **Multi-Round Negotiation** (up to 3 rounds of revision)
### Sampling Strategy
- Strictly follows **sampling constraints** (target academic level, grade, gender, subject preferences)
- Covers **9 grades** × **2 genders** × **4 academic levels**
- Balanced **academic level distribution** (avoiding optimism bias)
---
## 🔍 Data Quality Metrics
- ✅ **Structural Compliance Rate**: 100% (all records pass validation rules)
- ✅ **SimHash Deduplication**: Hamming distance > 3 (effectively avoiding duplicates)
- ✅ **Text Diversity**: Distinct-1 in the range of 0.3-0.5
- ✅ **Jaccard Template Similarity**: Low templating (natural language generation)
- ✅ **Paragraph Length Distribution**: Conforms to natural paragraph characteristics
- ✅ **Cross-Consistency**: 95%+ (no contradictions across fields)
---
## 🚀 Use Cases
### Education
- **Personalized Education**: Customize teaching plans for different student profiles
- **Mental Health Assessment**: Analyze and monitor student psychological states
- **Educational Research**: Study student development patterns and individual characteristics
### AI & NLP
- **Agent Simulation**: Create authentic student agents validated against real-world data
- **Evaluation Benchmark**: Benchmark LLM quality in generating character profiles
- **Psychological Measurement**: Explore feasibility of LLM-generated personality profiles
- **Diversity Evaluation**: Test diversity and consistency of generated text
### Social Science Research
- **Student Behavior Analysis**
- **Values Evolution Research**
- **Creativity Assessment & Development**
- **Adolescent Development Characteristics Analysis**
---
## 📂 File Structure
```
sample_data/
└── merged_students_10k.jsonl # 10,000 student profile records in Chinese
└── merged_students_10k_EN.jsonl # 10,000 student profile records in English
```
---
## 💻 Quick Start
### Python Reading Example
```python
import json
# Read the dataset
with open('merged_students_10k.jsonl', 'r', encoding='utf-8') as f:
for line in f:
student = json.loads(line)
agent_name = student.get('代理名', 'N/A')
print(f"Agent: {agent_name}, Age: {student['年龄']}, Grade: {student['年级']}")
print(f"Academic Level: {student['学术水平']}")
print("-" * 50)
```
### Statistical Analysis Example
```python
from collections import Counter
# Count academic level distribution
academic_levels = []
with open('merged_students_10k.jsonl', 'r', encoding='utf-8') as f:
for line in f:
student = json.loads(line)
academic_levels.append(student['学术水平'])
# Output statistics
print("Academic Level Distribution:")
for level, count in Counter(academic_levels).most_common():
print(f"{level}: {count}")
```
---
## 📊 Data Statistics
### Grade Distribution
- **Elementary School**: Grades 1-6 (~60%)
- **Middle School**: Grades 7-9 (~30%)
- **High School**: Grades 10-12 (~10%)
### Gender Distribution
- **Female**: ~50%
- **Male**: ~50%
### Academic Level Distribution
- **High**: Top 10% in school (~20%)
- **Medium**: Top 10-30% in school (~30%)
- **Low**: Top 30-50% in school (~30%)
- **Poor**: Bottom 50% in school (~20%)
---
## ⚠️ Important Notes
1. **Data Source**: This dataset is generated by an **LLM-based multi-agent system** for research and evaluation purposes
2. **Anonymization**: ⚠️ **All personal names have been removed** for privacy protection. Each student is identified by an agent name (pinyin identifier, e.g., "wang2_shi1han2") instead of a real name.
3. **Ethical Use**: Do not use data for discriminatory assessment or harmful purposes
4. **Diversity Assurance**: Multiple mechanisms (SimHash, 15 validation rules, multi-round validation) ensure data quality
---
## 📄 Citation
This dataset is associated with the following paper:
**Generating Authentic Student Profiles: A Multi-Agent Collaboration Approach**
**Authors:**
- **Yilin Jiang**¹²
- **Fei Tan**¹* (Corresponding author)
- **Xuanyu Yin**¹
- **Jing Leng**¹
- **Aimin Zhou**¹³
**Affiliations:**
- ¹ East China Normal University, Shanghai, China
- ² The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
- ³ Shanghai Innovation Institute, Shanghai, China
📧 **Contact:** ftan@mail.ecnu.edu.cn
### Validation Results
Our generated student profiles have been validated against real-world educational datasets:
**CEPS (China Education Panel Survey) Validation Results:**

**PISA Validation Results:**

> 🔔 **Note:** The full dataset and source code will be open-sourced soon. Stay tuned!
---
## 📄 License
This dataset is released under **MIT License**
---
## 📮 Contact
**Correspondence:**
- **Yilin Jiang**: jiangyilin021104@gmail.com
- **Fei Tan**: ftan@mail.ecnu.edu.cn
If you use this dataset, please cite:
```bibtex
@article{jiang2026hachimi,
title={HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents},
author={Jiang, Yilin and Tan, Fei and Yin, Xuanyu and Leng, Jing and Zhou, Aimin},
journal={arXiv preprint arXiv:2603.04855},
year={2026}
}
```
---
**Last Updated**: 2026-02-12
提供机构:
sii-research



