WebOrganizer/Corpus-200B
收藏Hugging Face2025-02-19 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/WebOrganizer/Corpus-200B
下载链接
链接失效反馈官方服务:
资源简介:
WebOrganizer/Corpus-200B数据集是一个基于CommonCrawl数据集的预处理版本,包含了2000亿个token。该数据集经过RefinedWeb过滤和BFF去重,并带有两个质量分数、WebOrganizer域和k-means分数的标注。数据集提供了文档、token数量、质量分数、主题分类和格式分类等多种类型的文件,并包含域存在和共现的统计信息。
The WebOrganizer/Corpus-200B dataset is a pre-processed version of the CommonCrawl dataset, containing 200 billion tokens. It has been filtered with RefinedWeb and deduplicated with BFF, and is annotated with two quality scores, WebOrganizer domains, and k-means scores. The dataset includes various types of files such as documents, token counts, quality scores, topic classifications, and format classifications, as well as statistics on the presence and co-occurrence of domains.
提供机构:
WebOrganizer



