five

WebBrain-Raw

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/qhjqhj00/webbrain-data
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是通过提取英文维基百科文章及其可爬取的维基百科引用构建的大规模数据集,使得实验者能够进行生成事实性文章的实验。此外,从WebBrain-Raw中,我们构建了两个特定任务的数据集(WebBrain-R和WebBrain-G),分别用于训练领域内的检索器和生成器。该数据集的规模是之前最大同行数据集的十倍,其任务旨在为查询生成带有引用的短篇事实性文章。

This large-scale dataset is constructed by extracting English Wikipedia articles and their crawlable Wikipedia citations, enabling researchers to conduct experiments on generating factual articles. Furthermore, two task-specific datasets (WebBrain-R and WebBrain-G) were developed from WebBrain-Raw, which are respectively used for training in-domain retrievers and generators. The scale of this dataset is ten times that of the previous largest peer-reviewed dataset, and its core task aims to generate short factual articles with citations for given queries.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作