Reubencf/adaption-multilingual-doc-qa

Name: Reubencf/adaption-multilingual-doc-qa
Creator: Reubencf
Published: 2026-04-24 06:37:32
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Reubencf/adaption-multilingual-doc-qa

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个多语言文档问答数据集，专注于从文档中提取特定的事实细节（如页码、姓名、年龄、日期、计数、标题等）。每个条目包含一个询问特定细节的“提示”和一个基于源页面文本提供精确答案的“完成”。数据集的特点是跨语言，源文档和问题通常使用不同的语言，因此也可作为跨语言信息检索的基准。数据集包含8,801行数据，采用指令调优格式，具有增强的提示/完成/推理列。数据集涵盖了10种问答目标语言和48种页面/OCR源语言，特别关注低资源语言。领域分布包括其他（50%）、历史（10%）和产品建议（6%）。语气分布为信息性（68%）、清晰（18%）和帮助性（10%）。

This dataset is a multilingual document question-answering dataset focused on extracting specific factual details from documents (page numbers, names, ages, dates, counts, titles, etc.). Each entry consists of a `prompt` asking for a specific detail and a `completion` providing the precise answer grounded in the source page text. The dataset is cross-lingual, with the source document and the question often in different languages, making it also a benchmark for cross-lingual information retrieval. The dataset contains 8,801 rows in an instruction-tuning format with `enhanced_prompt`, `enhanced_completion`, and reasoning columns. It covers 10 Q/A target languages and 48 page/OCR source languages, with a special focus on low-resource languages. The domain distribution includes Other (50%), History (10%), and Product-advice (6%). The tone distribution is Informative (68%), Clear (18%), and Helpful (10%).

提供机构：

Reubencf

5,000+

优质数据集

54 个

任务类型

进入经典数据集