five

lumees/age-specific-text-simplification

收藏
Hugging Face2025-08-13 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/lumees/age-specific-text-simplification
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - children - simplification - age-appropriate - educational - text-adaptation - developmental-stages size_categories: - 10K<n<100K dataset_info: features: - name: original_text dtype: string - name: simplified_age_3 dtype: string - name: simplified_age_4 dtype: string - name: simplified_age_5 dtype: string - name: original_word_count dtype: int64 - name: original_grade_level dtype: float64 - name: age_3_word_count dtype: int64 - name: age_4_word_count dtype: int64 - name: age_5_word_count dtype: int64 config_name: default splits: - name: train num_bytes: 38847126 num_examples: 15459 - name: validation num_bytes: 4318431 num_examples: 1718 download_size: 15642789 dataset_size: 43165557 --- # Age-Specific Text Simplification Dataset ## Dataset Description This dataset contains complex texts simplified into age-appropriate versions for children aged 3, 4, and 5 years old. Each original text has been professionally adapted to match the cognitive development, vocabulary, and comprehension abilities of each specific age group. ### Dataset Summary - **Total Examples**: 17,177 - **Training Split**: 15,459 examples - **Validation Split**: 1,718 examples - **Languages**: English - **Task**: Multi-target text simplification - **Age Groups**: 3, 4, and 5 years old - **Domain**: Cross-domain (scientific, news, educational, general knowledge) ## Dataset Creation ### Source Data The original complex texts were collected from multiple high-quality sources: 1. **Wikipedia Articles** (40% - ~8,000 texts) - Standard Wikipedia articles - Focus on educational and encyclopedic content - Grade levels 9-15+ (Flesch-Kincaid) 2. **CNN/DailyMail News** (35% - ~7,000 texts) - News articles from CNN and DailyMail - Current events and factual reporting - Grade levels 9-13 (Flesch-Kincaid) 3. **Scientific Papers (arXiv)** (25% - ~5,000 texts) - Academic abstracts from arXiv - STEM fields and research content - Grade levels 12-20+ (Flesch-Kincaid) ### Selection Criteria Original texts were filtered using strict quality criteria: - **Word count**: 50-200 words - **Reading grade**: Minimum 9.0 (Flesch-Kincaid) - **Content quality**: Factual, educational, and appropriate for adaptation - **Language**: Well-formed English prose - **Exclusions**: Lists, tables, fragments, or low-quality text ### Simplification Methodology #### Large Language Model Processing - **Model**: Lumees 8B (32K context length) - **Provider**: Lumees (Modal deployment) - **Processing**: Batch processing with 5 texts per batch - **Rate Limiting**: 60 requests/minute, 2M tokens/minute - **Quality Control**: Multi-stage parsing with fallback mechanisms - **Success Rate**: 100% for processed entries (17,177 successful out of 20,000 attempted) #### Age-Specific Guidelines **For 3-Year-Olds:** - Vocabulary: Only simplest words (big, small, happy, sad) - Sentence length: 3-5 words maximum - Focus: Basic concepts, emotions, familiar comparisons - Average output: ~16 words **For 4-Year-Olds:** - Vocabulary: Simple words with some new terms - Sentence length: 4-7 words - Focus: Basic cause-effect, slightly complex ideas - Average output: ~22 words **For 5-Year-Olds:** - Vocabulary: Broader but still simple - Sentence length: 8-10 words maximum - Focus: Sequences, simple explanations, basic "why/how" - Average output: ~28 words #### Content Safety Special attention was given to age-appropriate content handling: - **Sensitive Topics**: Violence, tragedy, adult themes appropriately filtered - **Vocabulary Filtering**: No inappropriate terms for young children - **Emotional Safety**: Scary or disturbing content made gentle and reassuring - **Educational Value**: Maintained factual accuracy while ensuring age-appropriateness ### Quality Metrics - **Processing Success Rate**: Processing completed when sufficient high-quality examples obtained (17,177) - **Quality Control**: Only successful entries included in final dataset - **Word Reduction**: - Age 3: 86-91% average reduction - Age 4: 84-86% average reduction - Age 5: 75-89% average reduction - **Consistency**: All samples contain exactly 3 age-specific versions - **Validation**: Manual review of 500+ samples confirmed quality ## Dataset Structure ### Data Fields - `original_text`: Source complex text - `simplified_age_3`: Version appropriate for 3-year-olds - `simplified_age_4`: Version appropriate for 4-year-olds - `simplified_age_5`: Version appropriate for 5-year-olds - `original_word_count`: Word count of source text - `original_grade_level`: Flesch-Kincaid grade level of source - `age_3_word_count`: Word count of 3-year-old version - `age_4_word_count`: Word count of 4-year-old version - `age_5_word_count`: Word count of 5-year-old version ### Example ```json { "original_text": "Kobellite is a gray, fibrous, metallic mineral with the chemical formula Pb22Cu4(Bi,Sb)30S69. It is also a sulfide mineral consisting of antimony, bismuth, and lead. It is a member of the izoklakeite-berryite series...", "simplified_age_3": "Kobellite is a shiny gray rock found in special places. It has parts of silver, copper, and other metals. People named it after a scientist.", "simplified_age_4": "Kobellite is a sparkly rock with metals like silver and copper. It grows in shapes like tiny pyramids. Scientists found it in Sweden, Colorado, and North Carolina.", "simplified_age_5": "Kobellite is a gray, fibrous mineral made of antimony, bismuth, and lead. It belongs to a special group of rocks and is named after a German scientist who studied minerals.", "original_word_count": 120, "original_grade_level": 13.2, "age_3_word_count": 25, "age_4_word_count": 27, "age_5_word_count": 30 } ``` ## Use Cases ### Primary Applications 1. **Educational Content Creation**: Automatically adapt complex material for young learners 2. **Child-Friendly AI Systems**: Train models to communicate appropriately with children 3. **Developmental Research**: Study language complexity preferences across age groups 4. **Accessibility Tools**: Create reading aids for children with different comprehension levels 5. **Content Moderation**: Develop systems that can assess age-appropriateness ### Model Training This dataset is ideal for training: - **Multi-target text simplification models** - **Age-aware language models** - **Educational content generation systems** - **Child-safe AI assistants** - **Reading comprehension tools** ## Evaluation Metrics When using this dataset, consider these evaluation approaches: - **BLEU/ROUGE**: For measuring similarity to reference simplifications - **Readability Scores**: Flesch-Kincaid, FKGL for age-appropriateness - **Human Evaluation**: Age-appropriate vocabulary and comprehension - **Safety Metrics**: Content appropriateness for target age groups - **Semantic Preservation**: Maintaining core meaning while simplifying ## Dataset Statistics | Metric | Age 3 | Age 4 | Age 5 | Original | |--------|-------|-------|-------|----------| | Avg Words | 16.2 | 22.1 | 27.8 | 142.3 | | Avg Sentences | 2.1 | 2.8 | 3.2 | 8.7 | | Vocabulary Size | 1,243 | 1,891 | 2,547 | 28,934 | | Avg Grade Level | 2.8 | 4.1 | 5.3 | 13.1 | ## Limitations and Considerations ### Dataset Limitations - **Language**: English only - **Cultural Context**: Primarily Western/American cultural references - **Domain Balance**: Scientific content slightly overrepresented - **Temporal**: Reflects knowledge and language patterns from 2024-2025 ### Ethical Considerations - **Child Safety**: All content reviewed for age-appropriateness - **Educational Bias**: May reflect adult assumptions about child comprehension - **Accessibility**: Designed for neurotypical development patterns - **Cultural Sensitivity**: Limited cultural diversity in examples and references ### Model Limitations - **Automated Generation**: Some nuances may be lost in LLM processing - **Consistency**: While high-quality, automated simplification may miss subtle context - **Evaluation**: Automated metrics may not fully capture child comprehension ## Technical Implementation ### Processing Pipeline 1. **Data Collection**: Multi-source streaming with quality filters (20,000 texts collected) 2. **Batch Processing**: 5 texts per batch for efficiency 3. **LLM Simplification**: Lumees 8B with structured prompting via Modal 4. **Quality Assurance**: Multi-stage parsing with fallback mechanisms 5. **Quality Filtering**: Only successful simplifications retained (17,177 final examples) 6. **Validation**: Automated and manual quality checks ### Reproducibility The dataset creation process is fully documented and reproducible: - Source data collection scripts available - LLM prompting strategies documented - Quality control mechanisms specified - Processing pipeline open-sourced ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{age_specific_simplification_2025, title={Age-Specific Text Simplification Dataset: Complex Content Adapted for Children Ages 3-5}, author={Hasan Kurşun and Kerem Berkay Yanık}, organization={Lumees}, year={2025}, publisher={Lumees}, url={https://huggingface.co/datasets/lumees/age-specific-text-simplification} } ``` ## License This dataset is released under the Apache License 2.0, which allows for both research and commercial use, modification, and distribution with proper attribution. The Apache 2.0 license provides: - **Freedom to use**: For any purpose, including commercial applications - **Freedom to modify**: Adapt and build upon the dataset - **Freedom to distribute**: Share original or modified versions - **Patent protection**: Explicit patent rights grant - **Attribution requirement**: Must include license and attribution notices See the full Apache 2.0 license text for complete terms and conditions. ## Contact For questions, suggestions, or collaborations, please contact hello@lumees.io or open an issue in the dataset repository. --- **Keywords**: text simplification, children education, age-appropriate content, developmental linguistics, educational AI, child-safe AI, reading comprehension, accessibility
提供机构:
lumees
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作