five

InstructDoc

收藏
arXiv2024-01-24 更新2024-06-21 收录
下载链接:
https://github.com/nttmdlab-nlp/InstructDoc
下载链接
链接失效反馈
官方服务:
资源简介:
InstructDoc是由NTT Corporation和东北大学联合创建的大型视觉文档理解数据集,包含30个公开可用的数据集,覆盖12种不同任务,如问答和信息提取。数据集中的每个文档都附有专家标注的多样化指令,遵循统一的指令格式,包括用户的意图和答案风格。InstructDoc旨在通过手工指令提高对开放文档类型/格式的理解能力,如文档布局、文本的可视化表示及对象(如图表)的关系提取。该数据集的应用领域广泛,旨在解决视觉文档理解中的零样本泛化问题。

InstructDoc is a large-scale visual document understanding dataset jointly created by NTT Corporation and Tohoku University. It includes 30 publicly available datasets, covering 12 distinct tasks such as question answering and information extraction. Each document in the dataset is paired with expert-annotated diverse instructions adhering to a unified instruction format, which covers user intent and answer style. InstructDoc aims to enhance the understanding capability of open document types and formats, such as document layout, visual representation of text, and relationship extraction for objects like charts. The dataset has a wide range of application scenarios and is designed to address the zero-shot generalization problem in visual document understanding.
提供机构:
NTT Corporation 和东北大学
创建时间:
2024-01-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作