ChroniclingAmericaQA
收藏arXiv2024-05-11 更新2024-06-21 收录
下载链接:
https://github.com/DataScienceUIBK/ChroniclingAmericaQA
下载链接
链接失效反馈官方服务:
资源简介:
ChroniclingAmericaQA是一个大规模的问答数据集,由奥地利因斯布鲁克大学的研究人员创建。该数据集包含48.7万个基于美国历史报纸页面的问答对,这些报纸来自Chronicling America收藏,涵盖了1800至1920年间的120年历史。数据集的创建旨在推动问答和机器阅读理解任务的发展,特别是在处理历史文档方面。ChroniclingAmericaQA不仅提供了原始的扫描图像和噪声内容,还包括了经过校正的内容,以便于模型在不同质量的文本上进行测试。此外,该数据集的应用领域广泛,包括教育资源和公众对历史文档的参与,以及作为训练和评估模型处理历史文本能力的基准。
ChroniclingAmericaQA is a large-scale question answering (QA) dataset created by researchers from the University of Innsbruck, Austria. This dataset contains 487,000 question-answer pairs based on historical American newspaper pages sourced from the Chronicling America collection, spanning 120 years from 1800 to 1920. The dataset was developed to advance research on QA and machine reading comprehension tasks, particularly in the context of historical documents. ChroniclingAmericaQA not only provides raw scanned images and noisy textual content, but also corrected textual content, enabling models to be tested on text of varying quality. Additionally, this dataset has a wide range of applications, including serving as educational resources, facilitating public engagement with historical documents, and acting as a benchmark for training and evaluating models' ability to process historical text.
提供机构:
University of Innsbruck
创建时间:
2024-03-27



