five

Common Crawl Question Answering (CCQA) dataset

收藏
arXiv2022-05-03 更新2024-06-21 收录
下载链接:
https://github.com/facebookresearch/CCQA
下载链接
链接失效反馈
官方服务:
资源简介:
CCQA数据集是由Meta AI和英属哥伦比亚大学合作创建的一个大规模开放领域问答数据集,基于Common Crawl项目,包含约1.3亿个多语言问答对,其中约6000万为英语数据点。该数据集旨在通过大规模、自然、多样且高质量的语料库,为问答任务预训练流行的语言模型。CCQA数据集通过schema.org的明确标注生成问答对,确保数据质量,并支持多种问答相关任务,如答案选择、评分和排名。此外,数据集还包含投票、多个竞争性答案、问题摘要和HTML标记等额外数据属性,适用于未来的多种研究应用。

The CCQA dataset is a large-scale open-domain question answering dataset co-created by Meta AI and the University of British Columbia. Based on the Common Crawl project, it contains approximately 130 million multilingual question-answer pairs, among which about 60 million are English data points. The core goal of this dataset is to pre-train mainstream language models for question answering tasks by leveraging a large-scale, natural, diverse and high-quality corpus. The CCQA dataset generates question-answer pairs via explicit schema.org annotations to ensure data quality, and supports a wide range of question answering-related tasks such as answer selection, scoring and ranking. In addition, the dataset also includes additional data attributes such as votes, multiple competing answers, question summaries and HTML markup, which are applicable to a variety of future research applications.
提供机构:
英属哥伦比亚大学; Meta AI
创建时间:
2021-10-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作