WebBrain-Raw
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/qhjqhj00/webbrain-data
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是通过提取英文维基百科文章及其可爬取的维基百科引用构建的大规模数据集,使得实验者能够进行生成事实性文章的实验。此外,从WebBrain-Raw中,我们构建了两个特定任务的数据集(WebBrain-R和WebBrain-G),分别用于训练领域内的检索器和生成器。该数据集的规模是之前最大同行数据集的十倍,其任务旨在为查询生成带有引用的短篇事实性文章。
This large-scale dataset is constructed by extracting English Wikipedia articles and their crawlable Wikipedia citations, enabling researchers to conduct experiments on generating factual articles. Furthermore, two task-specific datasets (WebBrain-R and WebBrain-G) were developed from WebBrain-Raw, which are respectively used for training in-domain retrievers and generators. The scale of this dataset is ten times that of the previous largest peer-reviewed dataset, and its core task aims to generate short factual articles with citations for given queries.



