five

SPHERE

收藏
arXiv2025-09-30 收录
下载链接:
https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集名为Sphere,是从单一CCNet快照的首层中创建的网络语料库,包含了1.34亿篇网络文章,生成了9.063亿个100个词元的段落。Sphere旨在促进知识密集型自然语言处理的研究,并作为知识源,使得最先进的系统能够与基于维基百科的模型相匹敌或超越。该数据集的规模包括1.34亿篇文章和9.063亿个段落,适用于知识密集型自然语言处理任务。

The dataset named Sphere is a web corpus constructed from the first layer of a single CCNet snapshot. It encompasses 134 million web articles, yielding 906.3 million 100-token paragraphs. Sphere is designed to advance research in knowledge-intensive natural language processing, and serve as a knowledge resource that enables state-of-the-art NLP systems to match or outperform Wikipedia-based models. With its scale including 134 million articles and 906.3 million paragraphs, the dataset is suitable for knowledge-intensive natural language processing tasks.
提供机构:
CCNet
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作