Reubencf/Adaption-low-resource-doc-qa
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/Adaption-low-resource-doc-qa
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个多语言文档问答数据集,专注于低资源语言。包含10,200行数据,源自公共领域的杂志/报纸页面,每行包含页面的原始语言OCR文本、英文描述以及10种目标语言之一的生成问答对。数据集覆盖45种以上的源语言,重点在于低资源语言(约90%的数据行使用的语言不在“十大语言”之列)。目标语言包括阿拉伯语、德语、英语、西班牙语、法语、印地语、意大利语、日语、葡萄牙语和中文。数据集适用于文档理解、OCR-VQA模型以及跨语言问答任务的评估和训练。
This dataset is a multilingual document question-answering dataset with a deliberate focus on low-resource languages. It contains 10,200 rows of data derived from public-domain magazine/newspaper pages, each including verbatim OCR in the pages native language, an English description, and a generated Q/A pair in one of 10 target languages. The dataset covers 45+ distinct source languages, heavily tilted toward low-resource languages (~90% of rows are in languages outside the big 10). The target languages include Arabic, German, English, Spanish, French, Hindi, Italian, Japanese, Portuguese, and Chinese. It is intended for evaluating and training document-understanding/OCR-VQA models and benchmarking cross-lingual question answering where the document and the query are in different languages.
提供机构:
Reubencf



