WebBrain-Raw
收藏arXiv2023-04-10 更新2024-06-21 收录
下载链接:
https://github.com/qhjqhj00/WebBrain
下载链接
链接失效反馈官方服务:
资源简介:
WebBrain-Raw数据集是由高瓴人工智能学院的研究团队构建的,旨在支持WEBBRAIN任务,即通过网络挖掘支持证据来生成针对查询的事实正确文章。该数据集从英文维基百科中提取了所有文章及其可访问的参考文献,规模达到1486万篇文章,是之前最大同类数据集的十倍。数据集的创建过程涉及从维基百科中提取文章和参考文献,并进行数据清洗以确保质量。WebBrain-Raw数据集的应用领域包括自动生成维基页面、智能写作辅助和知识密集型问答等,旨在解决如何从网络世界中自动获取知识并服务于人类更广泛的事实导向信息需求的问题。
The WebBrain-Raw dataset was constructed by the research team at the Gaoling School of Artificial Intelligence, aiming to support the WebBrain task, which focuses on generating factually accurate articles for user queries by mining supporting evidence from the web. This dataset extracts all articles and their accessible references from the English Wikipedia, totaling 14.86 million articles, which is 10 times the size of the largest prior similar dataset. The construction of the WebBrain-Raw dataset entails extracting articles and their references from Wikipedia, followed by data cleaning to ensure high data quality. Its application scenarios include automatic Wikipedia page generation, intelligent writing assistance, knowledge-intensive question answering, and other related fields. This dataset is developed to address the challenge of automatically extracting knowledge from the web and catering to the broader fact-oriented information needs of humans.
提供机构:
高瓴人工智能学院,中国人民大学,北京,中国
创建时间:
2023-04-10



