five

Ultra-FineWeb-EDU

收藏
魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/ProCreations/Ultra-FineWeb-EDU
下载链接
链接失效反馈
官方服务:
资源简介:
# Ultra FineWeb EDU <div align="center"> **High-Quality Educational Content from Ultra-FineWeb** *Filtered for Maximum Educational Value* [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Dataset](https://img.shields.io/badge/🤗%20Dataset-Ultra--FineWeb--EDU-yellow)](https://huggingface.co/datasets/) [![Quality](https://img.shields.io/badge/Quality-Premium%20Educational-green)]() </div> ## 📚 Overview Ultra FineWeb EDU is a premium educational dataset created by applying advanced educational content filtering to the exceptional [Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) dataset. This work builds directly upon two foundational achievements: the rigorous data curation methodology of Ultra-FineWeb and the sophisticated educational classification capabilities of the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). We extract only the highest quality educational content with a strict threshold of **3.5+ educational score**. ## ⭐ Key Features - **🎯 Premium Quality**: Only content scoring 3.5+ on educational value (top ~10% of Ultra-FineWeb) - **📖 Pure Content**: Metadata stripped, contains only the essential text content - **🔍 Rigorous Filtering**: Multi-stage filtering pipeline ensures exceptional quality - **⚡ Optimized Processing**: High-performance GPU-accelerated filtering pipeline - **🤝 Community Driven**: Open-source processing code for reproducibility and extension ## 📊 Dataset Statistics ### Filtering Pipeline Overview ``` Raw Web Content (Trillions of pages) ↓ (Heavy filtering) FineWeb (24.99B examples) ↓ (94.83% filtered out) Ultra-FineWeb (1.29B examples) ↓ (90% filtered out - Educational threshold 3.5+) Ultra FineWeb EDU (64,000+ examples) ← This Dataset ``` ### Quality Metrics - **Educational Threshold**: 3.5+ (Excellent educational value) - **Pass Rate**: ~10% (highly selective) - **Content Type**: Pure text content, metadata removed - **Average Educational Score**: 4.2+ (estimated for passed content) - **Language**: English (with potential for multilingual expansion) - **Current Release**: 64,000+ premium educational samples ## 🏗️ Creation Methodology **Building on Proven Excellence**: This dataset leverages the battle-tested methodologies from Ultra-FineWeb's efficient verification-based filtering and FineWeb-Edu's expert-validated educational classification. ### Educational Classification We used the proven [HuggingFace FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), trained on 450k expert annotations, to score each sample: - **Score 0-1**: Not educational / Low educational value → **Filtered out** - **Score 2-3**: Some to good educational value → **Filtered out** - **Score 3.5+**: High to excellent educational value → **✅ Included** ### Processing Pipeline 1. **Stream Ultra-FineWeb** in batches for memory efficiency 2. **Extract content** field only (remove metadata) 3. **Educational scoring** using BERT-based classifier 4. **Threshold filtering** at 3.5+ educational score 5. **Quality validation** and dataset compilation ## 🚀 Performance Optimizations Our processing pipeline achieves **350+ samples/second** using: - ⚡ FP16 precision for 2x speed boost - 🔥 Large batch processing (512+ samples) - 🎯 GPU memory optimization - 💾 Automatic checkpointing every 30 minutes - 🔄 Smart memory management and cleanup ## 📁 Dataset Structure ```json { "content": "High-quality educational text content..." } ``` Each sample contains only the `content` field with educational text, optimized for training language models focused on educational applications. ## 🛠️ Processing Code The complete processing pipeline is open-sourced to enable community scaling and reproduction. The code includes optimizations for high-speed GPU processing, automatic checkpointing, and educational quality filtering. ### Requirements ```bash pip install torch transformers datasets tqdm numpy pandas ``` *Complete processing script and documentation will be available in the repository.* ## 📈 Quality Analysis ### Educational Score Distribution (Based on 64,000+ Samples) - **Score 3.5-4.0**: Solid educational content (60% of passed samples) - **Score 4.0-4.5**: High-quality educational material (30% of passed samples) - **Score 4.5-5.0**: Exceptional educational resources (10% of passed samples) ## 🎯 Use Cases - **Educational AI Training**: Train models specifically for educational applications - **Content Quality Research**: Study high-quality web content characteristics - **Educational Content Generation**: Fine-tune models for creating educational materials - **Knowledge Distillation**: Transfer educational knowledge to smaller models - **Curriculum Development**: Analyze educational content patterns and structures ## 🤝 Community & Contributions This initial release of 64,000+ premium educational samples demonstrates the effectiveness of our filtering pipeline. The dataset represents a proof-of-concept for community-driven scaling. **How you can contribute:** - **Scale the processing**: Use our code to process additional Ultra-FineWeb data - **Quality improvements**: Suggest enhanced filtering techniques - **Multilingual expansion**: Apply similar filtering to other languages - **Research applications**: Share findings and use cases with the community **Next Steps:** The processing pipeline is designed for easy scaling. With access to larger compute resources, the complete Ultra-FineWeb dataset can be processed to yield an estimated 130M+ premium educational samples. ## 🚀 More Examples Coming Soon This initial release represents just the beginning! We're actively working to expand Ultra FineWeb EDU with additional high-quality educational content. **📈 Upcoming Releases:** - **Extended English Dataset**: Processing continues on the full Ultra-FineWeb English corpus - **Multilingual Support**: Chinese educational content from Ultra-FineWeb-zh - **Quality Improvements**: Enhanced filtering techniques and threshold optimization - **Community Contributions**: Datasets processed by community members with larger compute resources **🔄 Release Schedule:** - **Phase 1** (Current): 64,000+ samples - Proof of concept ✅ - **Phase 2** (Coming Soon): 500,000+ samples - Extended initial release - **Phase 3** (Future): 10M+ samples - Major expansion - **Phase 4** (Goal): 130M+ samples - Complete Ultra-FineWeb processing **📊 Stay Updated:** Follow this repository for announcements about new releases, expanded datasets, and community contributions. Each release will maintain the same rigorous 3.5+ educational quality threshold. *Processing speed: ~350 samples/second on consumer hardware. Community members with enterprise GPUs can significantly accelerate timeline.* ## 📄 Citation If you use Ultra FineWeb EDU in your research or applications, please cite: ```bibtex @dataset{procreations2025ultrafineweb_edu, title={Ultra FineWeb EDU: High-Quality Educational Content from Ultra-FineWeb}, author={ProCreations}, year={2025}, url={https://huggingface.co/datasets/[dataset-url]}, note={Filtered from Ultra-FineWeb using educational quality threshold 3.5+} } ``` ## 🙏 Acknowledgments This dataset stands on the shoulders of giants and would not be possible without the groundbreaking work of several teams: ### Core Foundations - **🏆 Ultra-FineWeb Team ([openbmb](https://huggingface.co/openbmb))**: For creating the exceptional Ultra-FineWeb dataset through their innovative efficient verification-based filtering pipeline. Their work represents a quantum leap in data quality, reducing 25B samples to 1.3B through rigorous curation. This dataset directly builds upon their outstanding research and methodology. ([Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb), [Technical Report](https://arxiv.org/abs/2505.05427)) - **🧠 FineWeb-Edu Team ([HuggingFaceFW](https://huggingface.co/HuggingFaceFW))**: For developing the sophisticated educational content classifier that makes this work possible. Their BERT-based model, trained on 450k expert annotations, provides the critical educational quality assessment that enables precise filtering. ([FineWeb-Edu Classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)) ### Additional Thanks - **FineWeb Team**: For the original high-quality web corpus that serves as the foundation for all subsequent work - **Llama3 Team**: For providing the annotations that trained the educational classifier - **Snowflake Arctic Team**: For the embedding model that powers the classifier - **Open Source Community**: For the tools, libraries, and collaborative spirit that enables this research ### Special Recognition The methodologies, quality standards, and technical innovations developed by the Ultra-FineWeb and FineWeb-Edu teams form the core foundation of this dataset. This work is essentially an application and extension of their remarkable contributions to the field of high-quality dataset curation. ## 📜 License This dataset is released under the **Apache 2.0 License**, consistent with the source Ultra-FineWeb dataset. Please ensure compliance with the original dataset licenses when using this data. ## 🔗 Related Resources - [Ultra-FineWeb Dataset](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) - [FineWeb-Edu Classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) - [Original FineWeb Dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb) - [Processing Code Repository](https://github.com/[your-repo]) --- <div align="center"> **Created by ProCreations** | **Powered by Community Collaboration** *Building better educational AI, one dataset at a time* 🚀📚 </div>

# 超细网教育数据集(Ultra FineWeb EDU) <div align="center"> **源自超细网(Ultra-FineWeb)的高质量教育内容** *经筛选以实现最优教育价值* [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Dataset](https://img.shields.io/badge/🤗%20Dataset-Ultra--FineWeb--EDU-yellow)](https://huggingface.co/datasets/) [![Quality](https://img.shields.io/badge/Quality-Premium%20Educational-green)]() </div> ## 📚 数据集概览 超细网教育数据集(Ultra FineWeb EDU)是一款优质教育数据集,通过对顶尖的[超细网(Ultra-FineWeb)](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)数据集应用先进的教育内容筛选流程构建而成。本数据集直接基于两项奠基性成果构建:一是超细网(Ultra-FineWeb)严谨的数据精选方法论,二是[细网教育分类器(FineWeb-Edu classifier)](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)的先进教育内容分类能力。我们仅提取符合**教育评分≥3.5**严苛标准的最高质量教育内容。 ## ⭐ 核心特性 - **🎯 优质选材**:仅保留教育价值评分≥3.5的内容(约为超细网数据集的前10%) - **📖 纯净内容**:已剥离元数据,仅保留核心文本内容 - **🔍 严格筛选**:采用多阶段筛选流程,确保内容质量出众 - **⚡ 高效处理**:搭载高性能GPU加速的筛选流程 - **🤝 社区驱动**:提供开源处理代码,便于复现与扩展 ## 📊 数据集统计信息 ### 筛选流程总览 原始网页内容(万亿级页面) ↓(高强度筛选) 细网(FineWeb)数据集(249.9亿条样本) ↓(过滤掉94.83%的样本) 超细网(Ultra-FineWeb)数据集(12.9亿条样本) ↓(过滤掉90%的样本——教育评分阈值为3.5) 超细网教育数据集(Ultra FineWeb EDU)(6.4万+条样本) ← 本数据集 ### 质量指标 - **教育评分阈值**:≥3.5(具备优秀教育价值) - **通过率**:约10%(筛选极具针对性) - **内容类型**:纯文本内容,已移除元数据 - **平均教育评分**:≥4.2(通过筛选样本的估算值) - **语言**:英语(支持多语言扩展的潜力) - **当前发布量**:6.4万+条优质教育样本 ## 🏗️ 构建方法论 **基于成熟卓越的技术积淀**:本数据集采用了经过实战验证的技术方案,包括超细网(Ultra-FineWeb)高效验证式筛选流程,以及细网教育分类器(FineWeb-Edu classifier)经专家验证的教育内容分类能力。 ### 教育内容分类 我们采用经实战验证的[HuggingFace细网教育分类器(HuggingFace FineWeb-Edu classifier)](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)对每条样本进行评分,该分类器基于45万条专家标注数据训练而成: - **评分0-1**:非教育内容/教育价值极低 → **已过滤** - **评分2-3**:具备一定至良好教育价值 → **已过滤** - **评分≥3.5**:具备高至优秀教育价值 → **✅ 保留** ### 处理流程 1. **分批流式读取超细网数据集**:保障内存使用效率 2. **仅提取内容字段**:移除所有元数据 3. **基于BERT分类器进行教育评分** 4. **按≥3.5的教育评分阈值进行筛选** 5. **质量验证与数据集汇编** ## 🚀 性能优化 本处理流程可实现**≥350条样本/秒**的处理速度,采用以下优化手段: - ⚡ 采用FP16精度,实现2倍速度提升 - 🔥 支持大批次处理(≥512条样本) - 🎯 优化GPU内存使用 - 💾 每30分钟自动创建检查点 - 🔄 智能内存管理与清理 ## 📁 数据集结构 json { "content": "高质量教育文本内容..." } 每条样本仅包含`content`字段,存储教育文本内容,专为面向教育应用场景的大语言模型训练优化设计。 ## 🛠️ 处理代码 完整的处理流程已开源,便于社区进行规模化扩展与结果复现。代码集成了高速GPU处理优化、自动检查点功能以及教育质量筛选模块。 ### 依赖环境 bash pip install torch transformers datasets tqdm numpy pandas *完整的处理脚本与文档将在代码仓库中发布。* ## 📈 质量分析 ### 教育评分分布(基于6.4万+条样本) - **评分3.5-4.0**:稳定优质教育内容(占通过筛选样本的60%) - **评分4.0-4.5**:高质量教育素材(占通过筛选样本的30%) - **评分4.5-5.0**:顶级教育资源(占通过筛选样本的10%) ## 🎯 应用场景 - **教育AI训练**:专为教育应用场景训练大语言模型 - **内容质量研究**:研究高质量网页内容的特征与规律 - **教育内容生成**:微调模型以生成教育类素材 - **知识蒸馏**:将教育知识迁移至轻量化模型 - **课程开发**:分析教育内容的模式与结构 ## 🤝 社区与贡献 本次发布的6.4万+条优质教育样本,验证了我们筛选流程的有效性。本数据集是社区驱动规模化数据处理的概念验证项目。 **您可以通过以下方式参与贡献:** - **扩大处理规模**:使用我们的代码处理更多超细网数据集样本 - **优化质量标准**:提出改进的筛选方法 - **多语言扩展**:将同类筛选流程应用至其他语言数据集 - **研究应用**:与社区分享研究成果与应用案例 **后续规划:** 本处理流程设计为易于扩展。若具备充足算力资源,可对完整的超细网数据集进行处理,预计可产出超过1.3亿条优质教育样本。 ## 🚀 更多样本即将发布 本次发布仅为起点!我们正积极扩充超细网教育数据集(Ultra FineWeb EDU),添加更多高质量教育内容。 **📈 即将发布的版本:** - **扩展英语数据集**:正在处理完整的超细网英语语料库 - **多语言支持**:基于超细网中文数据集(Ultra-FineWeb-zh)的中文教育内容 - **质量优化**:采用改进的筛选技术与阈值优化方案 - **社区贡献数据集**:由社区用户借助大规模算力资源处理生成的数据集 **🔄 发布计划:** - **第一阶段(当前)**:6.4万+条样本——概念验证 ✅ - **第二阶段(即将上线)**:50万+条样本——扩展初始版本 - **第三阶段(未来)**:1000万+条样本——大规模扩展 - **第四阶段(目标)**:1.3亿+条样本——完整处理超细网数据集 **📊 及时获取更新:** 请关注本仓库以获取新版本发布、数据集扩展以及社区贡献的相关公告。所有发布版本均将保持≥3.5的严苛教育质量筛选标准。 *消费级硬件上的处理速度约为350条样本/秒。拥有企业级GPU的社区用户可大幅缩短发布周期。* ## 📄 引用方式 如果您在研究或应用中使用超细网教育数据集(Ultra FineWeb EDU),请按以下方式引用: bibtex @dataset{procreations2025ultrafineweb_edu, title={Ultra FineWeb EDU: High-Quality Educational Content from Ultra-FineWeb}, author={ProCreations}, year={2025}, url={https://huggingface.co/datasets/[dataset-url]}, note={Filtered from Ultra-FineWeb using educational quality threshold 3.5+} } ## 🙏 致谢 本数据集站在巨人的肩膀之上,若无多个团队的开创性工作,本项目无法实现: ### 核心技术基础 - **🏆 超细网(Ultra-FineWeb)团队([openbmb](https://huggingface.co/openbmb))**:通过创新性的高效验证式筛选流程构建了顶尖的超细网数据集。他们的工作实现了数据质量的跨越式提升,通过严谨的精选流程将250亿条样本压缩至13亿条。本数据集直接基于其卓越的研究成果与方法论构建。([超细网数据集(Ultra-FineWeb)](https://huggingface.co/datasets/openbmb/Ultra-FineWeb), [技术报告](https://arxiv.org/abs/2505.05427)) - **🧠 细网教育(FineWeb-Edu)团队([HuggingFaceFW](https://huggingface.co/HuggingFaceFW))**:开发了先进的教育内容分类器,为本项目的核心支撑。其基于BERT的模型经45万条专家标注数据训练而成,可提供精准的教育质量评估,实现高精度筛选。([细网教育分类器(FineWeb-Edu Classifier)](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)) ### 额外致谢 - **细网(FineWeb)团队**:提供了原始高质量网页语料库,为后续所有工作奠定基础 - **Llama3团队**:提供了训练教育分类器所需的标注数据 - **Snowflake Arctic团队**:提供了驱动分类器的嵌入模型 - **开源社区**:提供了本研究所需的工具、库以及协作精神 ### 特别致谢 超细网与细网教育团队开发的方法论、质量标准与技术创新构成了本数据集的核心基础。本项目本质上是对他们在高质量数据集精选领域卓越贡献的应用与扩展。 ## 📜 授权协议 本数据集采用**Apache 2.0开源协议**发布,与原始超细网数据集保持一致。使用本数据时,请确保遵守原始数据集的授权协议。 ## 🔗 相关资源 - [超细网数据集(Ultra-FineWeb Dataset)](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) - [细网教育分类器(FineWeb-Edu Classifier)](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) - [原始细网数据集(Original FineWeb Dataset)](https://huggingface.co/datasets/HuggingFaceFW/fineweb) - [处理代码仓库(Processing Code Repository)](https://github.com/[your-repo]) --- <div align="center"> **由ProCreations团队制作** | **由社区协作驱动** *一步一个脚印,构建更优质的教育AI* 🚀📚 </div>
提供机构:
maas
创建时间:
2025-08-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作