SPHERE
收藏arXiv2025-09-30 收录
下载链接:
https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为Sphere,是从单一CCNet快照的首层中创建的网络语料库,包含了1.34亿篇网络文章,生成了9.063亿个100个词元的段落。Sphere旨在促进知识密集型自然语言处理的研究,并作为知识源,使得最先进的系统能够与基于维基百科的模型相匹敌或超越。该数据集的规模包括1.34亿篇文章和9.063亿个段落,适用于知识密集型自然语言处理任务。
The dataset named Sphere is a web corpus constructed from the first layer of a single CCNet snapshot. It encompasses 134 million web articles, yielding 906.3 million 100-token paragraphs. Sphere is designed to advance research in knowledge-intensive natural language processing, and serve as a knowledge resource that enables state-of-the-art NLP systems to match or outperform Wikipedia-based models. With its scale including 134 million articles and 906.3 million paragraphs, the dataset is suitable for knowledge-intensive natural language processing tasks.
提供机构:
CCNet



