Reubencf/low-resource-multilingual-doc-qa

Name: Reubencf/low-resource-multilingual-doc-qa
Creator: Reubencf
Published: 2026-04-23 16:34:32
License: 暂无描述

Hugging Face2026-04-23 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Reubencf/low-resource-multilingual-doc-qa

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含超过10,000个来自多语言文档页面的问答对，涵盖意大利语、德语、中文、葡萄牙语和日语等语言。每个样本包括原始OCR文本、页面元数据以及关于文档中日期、标题、实体或内容细节的具体查询。数据集的结构旨在支持训练模型进行文档理解和跨语言信息提取任务。数据集共有9,724个数据点，是一个指令调优数据集。最终质量为B级，相对质量提升112.5%。领域分布为：其他（46%）、宗教（10%）、动物-自然（8%）。语言分布为：意大利语（12%）、阿拉伯语（12%）、日语（12%）。语气分布为：信息性（70%）、清晰（12%）、有帮助（10%）。

This dataset contains over 10,000 question-answer pairs derived from multilingual document pages, covering languages such as Italian, German, Chinese, Portuguese, and Japanese. Each sample includes the original OCR text, page metadata, and specific queries regarding dates, titles, entities, or content details found within the documents. The data is structured to support training models for document understanding and cross-lingual information extraction tasks. There are 9,724 data points in this dataset. This is an instruction tuning dataset. The final quality is B, with a relative quality improvement of 112.5%. Domain distribution: Other (46%), Religion (10%), Animal-nature (8%). Language distribution: Italian (12%), Arabic (12%), Japanese (12%). Tone distribution: Informative (70%), Clear (12%), Helpful (10%).

提供机构：

Reubencf

5,000+

优质数据集

54 个

任务类型

进入经典数据集