SPHERE

Name: SPHERE
Creator: CCNet
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为Sphere，是从单一CCNet快照的首层中创建的网络语料库，包含了1.34亿篇网络文章，生成了9.063亿个100个词元的段落。Sphere旨在促进知识密集型自然语言处理的研究，并作为知识源，使得最先进的系统能够与基于维基百科的模型相匹敌或超越。该数据集的规模包括1.34亿篇文章和9.063亿个段落，适用于知识密集型自然语言处理任务。

The dataset named Sphere is a web corpus constructed from the first layer of a single CCNet snapshot. It encompasses 134 million web articles, yielding 906.3 million 100-token paragraphs. Sphere is designed to advance research in knowledge-intensive natural language processing, and serve as a knowledge resource that enables state-of-the-art NLP systems to match or outperform Wikipedia-based models. With its scale including 134 million articles and 906.3 million paragraphs, the dataset is suitable for knowledge-intensive natural language processing tasks.

提供机构：

CCNet

5,000+

优质数据集

54 个

任务类型

进入经典数据集