five

sii-research/HACHIMI-1M

收藏
Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/sii-research/HACHIMI-1M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - zh - en --- # HACHIMI Student Profile Dataset > **HACHIMI (Human-centric Agent-based Character and Holistic Individual Modeling Infrastructure)** > A comprehensive student profile dataset generated using multi-agent collaboration --- ## 📊 Dataset Overview **Size**: 10,000 student profile records **Format**: JSONL (one JSON object per line) **Language**: Chinese(merged_students_10k.jsonl), English(merged_students_10k_EN.jsonl) **Encoding**: UTF-8 --- ## 🎯 Dataset Features This dataset contains **10,000 comprehensive, validated, and diverse Chinese student profiles**, covering different age groups from elementary to high school (ages 6-18). The profiles have been rigorously tested against real-world educational datasets (CEPS/PISA) to ensure authenticity and validity. Each record includes the following complete information: ### 1️⃣ Basic Attributes - **Age** & **Gender** ⚠️ *Note: Names have been removed for privacy protection* - **Grade** (Grades 1-12) - **Developmental Stages** (Piaget cognitive, Erikson psychosocial, Kohlberg moral development) ### 2️⃣ Academic Profile - **Academic Level** (High/Medium/Low/Poor - four fixed categories) - **Strong Subjects** & **Weak Subjects** - Subjects include: Chinese, Mathematics, English, Physics, Chemistry, Biology, History, Geography, Politics, Arts, Music, P.E., Information Technology, etc. ### 3️⃣ Psychological & Personality Traits - **Personality** (detailed character description) - **Values** (7 dimensions) - Moral Character - Physical & Mental Health - Legal Awareness - Social Responsibility - Political Identity - Cultural Literacy - Family Values - **Mental Health** (structured psychological assessment) ### 4️⃣ Social & Creativity - **Social Relationships** (peer interactions, family support, social support) - **Creativity** (8 dimensions) - Fluency, Novelty, Flexibility, Feasibility - Problem Discovery, Problem Analysis, Solution Proposal, Solution Improvement --- ## 📝 Sample Data-CN ```json { "id": 1, "年龄": 13, "性别": "女", "年级": "初一", "发展阶段": { "皮亚杰认知发展阶段": "形式运算阶段", "埃里克森心理社会发展阶段": "身份与角色混淆", "科尔伯格道德发展阶段": "习俗水平" }, "擅长科目": ["美术"], "薄弱科目": ["数学", "英语", "物理"], "学术水平": "差:成绩全校排名后50%", "人格": "该生性格偏内向,乐于与熟悉的同学分享美术创作...", "价值观": "在道德修养上,尊重同学,偶尔愿意帮助身边朋友...", "社交关系": "在初一的学习生活中与同班同学关系较为平和...", "创造力": "流畅性表现中等,绘画时能较为顺畅地表达个人主题...", "心理健康": "整体心理状态偏向敏感,习惯自我观察..." } ``` --- ## 📝 Sample Data-EN ```json { "id": 1, "age": 13, "gender": "Female", "grade": "Grade 7", "developmental_stage": { "piaget_cognitive_stage": "Formal Operational Stage", "erikson_psychosocial_stage": "Identity vs. Role Confusion", "kohlberg_moral_stage": "Conventional Level" }, "strengths": ["Art"], "weaknesses": ["Mathematics", "English", "Physics"], "academic_level": "Poor: ranked in the bottom 50% of the school", "personality": "This student is relatively introverted and enjoys sharing her artwork with familiar classmates...", "values": "In terms of moral character, she respects her classmates and is occasionally willing to help friends around her...", "social_relationships": "During her first year of middle school, she maintains relatively harmonious relationships with her classmates...", "creativity": "Her fluency is at a moderate level, and when drawing, she is generally able to express her personal themes smoothly...", "mental_health": "Overall, her psychological state tends to be somewhat sensitive, and she is accustomed to self-observation..." } ``` --- ## 🏗️ Generation Methodology This dataset is generated by the **HACHIMI Multi-Agent Collaboration System** using the following technologies: ### System Architecture - **5 Specialized Agents** working collaboratively 1. Enrollment & Development Agent 2. Academic Profile Agent 3. Personality & Values Agent 4. Social & Creativity Agent 5. Mental Health Agent ### Quality Assurance - **Two-Stage Validation** (Fast Validator + Deep Validator) - **15 Validation Rules** (R1-R15) covering: - Age-grade consistency - Developmental stage alignment - Cross-field consistency - Academic level distribution - Structural integrity - **SimHash Deduplication** (Hamming distance threshold: 3) - **Multi-Round Negotiation** (up to 3 rounds of revision) ### Sampling Strategy - Strictly follows **sampling constraints** (target academic level, grade, gender, subject preferences) - Covers **9 grades** × **2 genders** × **4 academic levels** - Balanced **academic level distribution** (avoiding optimism bias) --- ## 🔍 Data Quality Metrics - ✅ **Structural Compliance Rate**: 100% (all records pass validation rules) - ✅ **SimHash Deduplication**: Hamming distance > 3 (effectively avoiding duplicates) - ✅ **Text Diversity**: Distinct-1 in the range of 0.3-0.5 - ✅ **Jaccard Template Similarity**: Low templating (natural language generation) - ✅ **Paragraph Length Distribution**: Conforms to natural paragraph characteristics - ✅ **Cross-Consistency**: 95%+ (no contradictions across fields) --- ## 🚀 Use Cases ### Education - **Personalized Education**: Customize teaching plans for different student profiles - **Mental Health Assessment**: Analyze and monitor student psychological states - **Educational Research**: Study student development patterns and individual characteristics ### AI & NLP - **Agent Simulation**: Create authentic student agents validated against real-world data - **Evaluation Benchmark**: Benchmark LLM quality in generating character profiles - **Psychological Measurement**: Explore feasibility of LLM-generated personality profiles - **Diversity Evaluation**: Test diversity and consistency of generated text ### Social Science Research - **Student Behavior Analysis** - **Values Evolution Research** - **Creativity Assessment & Development** - **Adolescent Development Characteristics Analysis** --- ## 📂 File Structure ``` sample_data/ └── merged_students_10k.jsonl # 10,000 student profile records in Chinese └── merged_students_10k_EN.jsonl # 10,000 student profile records in English ``` --- ## 💻 Quick Start ### Python Reading Example ```python import json # Read the dataset with open('merged_students_10k.jsonl', 'r', encoding='utf-8') as f: for line in f: student = json.loads(line) agent_name = student.get('代理名', 'N/A') print(f"Agent: {agent_name}, Age: {student['年龄']}, Grade: {student['年级']}") print(f"Academic Level: {student['学术水平']}") print("-" * 50) ``` ### Statistical Analysis Example ```python from collections import Counter # Count academic level distribution academic_levels = [] with open('merged_students_10k.jsonl', 'r', encoding='utf-8') as f: for line in f: student = json.loads(line) academic_levels.append(student['学术水平']) # Output statistics print("Academic Level Distribution:") for level, count in Counter(academic_levels).most_common(): print(f"{level}: {count}") ``` --- ## 📊 Data Statistics ### Grade Distribution - **Elementary School**: Grades 1-6 (~60%) - **Middle School**: Grades 7-9 (~30%) - **High School**: Grades 10-12 (~10%) ### Gender Distribution - **Female**: ~50% - **Male**: ~50% ### Academic Level Distribution - **High**: Top 10% in school (~20%) - **Medium**: Top 10-30% in school (~30%) - **Low**: Top 30-50% in school (~30%) - **Poor**: Bottom 50% in school (~20%) --- ## ⚠️ Important Notes 1. **Data Source**: This dataset is generated by an **LLM-based multi-agent system** for research and evaluation purposes 2. **Anonymization**: ⚠️ **All personal names have been removed** for privacy protection. Each student is identified by an agent name (pinyin identifier, e.g., "wang2_shi1han2") instead of a real name. 3. **Ethical Use**: Do not use data for discriminatory assessment or harmful purposes 4. **Diversity Assurance**: Multiple mechanisms (SimHash, 15 validation rules, multi-round validation) ensure data quality --- ## 📄 Citation This dataset is associated with the following paper: **Generating Authentic Student Profiles: A Multi-Agent Collaboration Approach** **Authors:** - **Yilin Jiang**¹² - **Fei Tan**¹* (Corresponding author) - **Xuanyu Yin**¹ - **Jing Leng**¹ - **Aimin Zhou**¹³ **Affiliations:** - ¹ East China Normal University, Shanghai, China - ² The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China - ³ Shanghai Innovation Institute, Shanghai, China 📧 **Contact:** ftan@mail.ecnu.edu.cn ### Validation Results Our generated student profiles have been validated against real-world educational datasets: **CEPS (China Education Panel Survey) Validation Results:** ![CEPS Results](ceps_results.jpg) **PISA Validation Results:** ![PISA Results](pisa_results.png) > 🔔 **Note:** The full dataset and source code will be open-sourced soon. Stay tuned! --- ## 📄 License This dataset is released under **MIT License** --- ## 📮 Contact **Correspondence:** - **Yilin Jiang**: jiangyilin021104@gmail.com - **Fei Tan**: ftan@mail.ecnu.edu.cn If you use this dataset, please cite: ```bibtex @article{jiang2026hachimi, title={HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents}, author={Jiang, Yilin and Tan, Fei and Yin, Xuanyu and Leng, Jing and Zhou, Aimin}, journal={arXiv preprint arXiv:2603.04855}, year={2026} } ``` --- **Last Updated**: 2026-02-12
提供机构:
sii-research
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作