five

Yxanul/fineweb-edu-highest-quality-2025

收藏
Hugging Face2025-08-23 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Yxanul/fineweb-edu-highest-quality-2025
下载链接
链接失效反馈
官方服务:
资源简介:
FineWeb-Edu最高质量数据集(2025集合)包含来自FineWeb-Edu数据集2025年常见爬取快照中精心筛选出的最高质量教育内容,总共有41.767亿个标记。这个数据集代表了精华中的精华,只有大约2%的文档满足严格的质量标准。每个文档都符合以下严格标准:标记长度至少1000个,教育质量得分至少3.5(前15%的教育质量),语言得分至少0.95的置信度(确保高质量的英语)。数据集采用Parquet格式存储,支持流式读取,包含多个压缩文件,每个文件包含一个快照批次的数据。数据集适用于预训练小型语言模型、微调现有模型以进行教育任务、继续预训练以提高教育能力以及进行高质量教育文本的研究。

The FineWeb-Edu Highest Quality Dataset (2025 Collection) contains the highest quality educational content carefully filtered from the FineWeb-Edu datasets 2025 Common Crawl snapshots, totaling 4.176 billion tokens. This dataset represents the best of the best, with only about 2% of documents meeting strict quality criteria. Each document meets the following strict standards: token length of at least 1,000, an educational score of at least 3.5 (top 15% educational quality), and a language score of at least 0.95 confidence (ensuring high-quality English). The dataset is stored in Parquet format with Snappy compression, supports streaming reads, and consists of multiple compressed files, each containing data from a batch of snapshots. It is suitable for pre-training smaller language models, fine-tuning existing models for educational tasks, continued pre-training to enhance educational capabilities, and research on high-quality educational text.
提供机构:
Yxanul
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作