SPHERE

Name: SPHERE
Creator: Facebook AI Research
Published: 2022-05-25 02:16:24
License: 暂无描述

arXiv2022-05-25 更新2024-06-21 收录

下载链接：

https://github.com/facebookresearch/Sphere

下载链接

链接失效反馈

官方服务：

资源简介：

SPHERE数据集是由Facebook AI Research创建的一个大规模网络语料库，用于支持知识密集型自然语言处理（KI-NLP）任务。该数据集包含超过9亿条数据，旨在提供比传统资源如维基百科更广泛和深入的知识覆盖。SPHERE数据集通过处理Common Crawl数据构建，经过去重、语言识别和质量过滤，最终形成了一个包含1.34亿篇网络文章的集合。该数据集特别适用于需要多样化信息需求的任务，如事实检查、开放领域问答和实体链接等。通过使用SPHERE数据集，研究人员能够探索和开发更先进的KI-NLP系统，尤其是在处理开放域环境和缺乏结构化数据的情况下。

The SPHERE dataset is a large-scale web corpus developed by Facebook AI Research to support knowledge-intensive natural language processing (KI-NLP) tasks. It encompasses over 900 million data entries, aiming to deliver broader and deeper knowledge coverage than traditional resources such as Wikipedia. Constructed by processing Common Crawl data, the dataset undergoes deduplication, language identification and quality filtering procedures, ultimately forming a curated collection of 134 million web articles. It is particularly well-suited for tasks requiring diverse information needs, including fact-checking, open-domain question answering and entity linking. By leveraging the SPHERE dataset, researchers can explore and develop more sophisticated KI-NLP systems, especially when handling open-domain scenarios with limited structured data.

提供机构：

Facebook AI Research

创建时间：

2021-12-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集