five

Reubencf/adaption-multilingual-doc-qa

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/adaption-multilingual-doc-qa
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个多语言文档问答数据集,专注于从文档中提取特定的事实细节(如页码、姓名、年龄、日期、计数、标题等)。每个条目包含一个询问特定细节的“提示”和一个基于源页面文本提供精确答案的“完成”。数据集的特点是跨语言,源文档和问题通常使用不同的语言,因此也可作为跨语言信息检索的基准。数据集包含8,801行数据,采用指令调优格式,具有增强的提示/完成/推理列。数据集涵盖了10种问答目标语言和48种页面/OCR源语言,特别关注低资源语言。领域分布包括其他(50%)、历史(10%)和产品建议(6%)。语气分布为信息性(68%)、清晰(18%)和帮助性(10%)。

This dataset is a multilingual document question-answering dataset focused on extracting specific factual details from documents (page numbers, names, ages, dates, counts, titles, etc.). Each entry consists of a `prompt` asking for a specific detail and a `completion` providing the precise answer grounded in the source page text. The dataset is cross-lingual, with the source document and the question often in different languages, making it also a benchmark for cross-lingual information retrieval. The dataset contains 8,801 rows in an instruction-tuning format with `enhanced_prompt`, `enhanced_completion`, and reasoning columns. It covers 10 Q/A target languages and 48 page/OCR source languages, with a special focus on low-resource languages. The domain distribution includes Other (50%), History (10%), and Product-advice (6%). The tone distribution is Informative (68%), Clear (18%), and Helpful (10%).
提供机构:
Reubencf
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作