OpenWeb888K
收藏魔搭社区2025-11-12 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/OpenWeb888K
下载链接
链接失效反馈官方服务:
资源简介:
# **OpenWeb Datasets Web Collection**
The OpenWeb Datasets Web Collection, derived from the 'FineWeb' dataset, consists of more than 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The data processing pipeline is optimized for LLM performance, and the necessary set of datasets has been extracted from Hugging Face's FineWeb collections. This dataset was created by processing 96 CommonCrawl dumps, comprising web data crawled from the summer of 2013 to April 2024. FineWeb includes a variety of domains and topics in English and is primarily intended to serve as a research artifact for public data in the context of pretraining datasets for large language models. The CommonCrawl data was carefully processed, filtered, and deduplicated using the Datatrove library, resulting in the largest publicly available clean LLM pretraining dataset, containing approximately 15 trillion tokens (using the GPT-2 tokenizer).
## FineWeb Dataset Overview
| **Dataset Name** | **Total Entries** | **Dataset Link** |
|-----------------|-----------------|-----------------|
| FineWeb | 25B | [FineWeb Dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb) |
# **OpenWeb 数据集网页集合**
本数据集源自FineWeb数据集,包含来自CommonCrawl的超15万亿词元(Token)的经清洗、去重处理的英文网页数据(采用GPT-2分词器统计)。其数据处理流程针对大语言模型(Large Language Model,LLM)性能进行了优化,所需数据集子集从Hugging Face的FineWeb数据集集合中提取得到。该数据集通过处理96份CommonCrawl快照构建,涵盖2013年夏季至2024年4月期间爬取的网页数据。FineWeb包含多领域、多主题的英文数据,主要作为大语言模型预训练场景下的公开研究数据集使用。CommonCrawl数据通过Datatrove库完成精细化处理、筛选与去重,最终成为当前规模最大的公开可用的干净LLM预训练数据集。
## FineWeb 数据集概览
| **数据集名称** | **总条目数** | **数据集链接** |
|-----------------|-----------------|-----------------|
| FineWeb | 250亿 | [FineWeb 数据集](https://huggingface.co/datasets/HuggingFaceFW/fineweb) |
提供机构:
maas
创建时间:
2025-02-07



