下载链接：

https://modelscope.cn/datasets/HuggingFaceM4/Docmatix

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Docmatix ![image/webp](https://cdn-uploads.huggingface.co/production/uploads/65d66b494bbd0d92b641cdbb/P7rIELr2eom_IorBY5DZu.webp) ## Dataset description Docmatix is part of the Idefics3 release (stay tuned). It is a massive dataset for Document Visual Question Answering that was used for the fine-tuning of the vision-language model Idefics3. ## Load the dataset To load the dataset, install the library `datasets` with `pip install datasets`. Then, ``` from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix") ``` If you want the dataset to link to the pdf files as binaries instead of the images, do: ``` from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix", "pdf") ``` ## Data fields An example of a sample looks as follows: ``` { "images" = [PIL.Image] "texts" = [ { "user": "What is the purpose of the Confirmation Statement mentioned in the document?", "assistant": "The purpose of the Confirmation Statement is to confirm that all information required to be delivered by the company to the registrar in relation to the confirmation period concerned has been delivered or is being delivered at the same time as the confirmation statement.", "source": "PDFA key: 244" }, { "user": "When was the filing received as per the document?", "assistant": "The filing was received for filing in Electronic Format on the 23/03/2021.", "source": "PDFA key: 244" }, ] } ``` In `images`, there is a list of up to 4 images, to be placed before the text. In `texts`, there is a conversation between a user and an assistant about the images that is represented by a list of turns. ## Comparison to other DocVQA datasets | Dataset | # images | # Q/A pairs | # tokens | |----------------------|----------|-------------|------------| | *Document visual question answering* | | **Docmatix** | **2,444,750**| **9,500,000** | **390,000,000**| | DocVQA | 10,189 | 39,463 | 337,829 | | TextCaps | 21,953 | 21,953 | 389,658 | | TextVQA | 21,953 | 34,602 | 181,918 | | ST-VQA | 17,247 | 23,121 | 127,846 | | OCR-VQA | 165,746 | 801,579 | 6,073,824 | | VisualMRC | 3,027 | 11,988 | 168,828 | | IAM | 5,663 | 5,663 | 144,216 | | InfoVQA | 2,118 | 10,074 | 61,048 | | Diagram image-to-text| 300 | 300 | 22,196 | # Citation **BibTeX:** ```bibtex @misc{laurençon2024building, title={Building and better understanding vision-language models: insights and future directions.}, author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon}, year={2024}, eprint={2408.12637}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```

# 数据集卡片：Docmatix ![image/webp](https://cdn-uploads.huggingface.co/production/uploads/65d66b494bbd0d92b641cdbb/P7rIELr2eom_IorBY5DZu.webp) ## 数据集概述 Docmatix 是 Idefics3 发布计划的一部分（敬请期待）。该数据集为大规模文档视觉问答（Document Visual Question Answering, DocVQA）数据集，曾用于视觉语言模型（vision-language model）Idefics3 的微调训练。 ## 数据集加载如需加载该数据集，请先通过 `pip install datasets` 安装 `datasets` 库，随后执行以下代码： from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix") 若需要将数据集关联为PDF二进制文件而非图像文件，请执行以下代码： from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix", "pdf") ## 数据字段单条样本的示例格式如下： { "images" = [PIL.Image] "texts" = [ { "user": "文档中提及的确认声明的用途是什么？", "assistant": "确认声明的用途为确认：公司已就相关确认期间向注册机构提交的所有法定要求信息，已与该确认声明同时完成提交，或将同步完成提交。", "source": "PDFA key: 244" }, { "user": "根据文档，该归档申请何时被收到？", "assistant": "该归档申请已于2021年3月23日以电子格式完成接收。", "source": "PDFA key: 244" }, ] } 在`images`字段中，存储最多4张图像的列表，需将其置于文本之前。在`texts`字段中，以多轮对话列表的形式存储用户与助手针对该图像的交互内容。 ## 与其他DocVQA数据集的对比 | 数据集名称 | 图像数量 | 问答对数量 | Token（Token）数量 | |--------------------------|----------|-------------|----------------------| | *文档视觉问答* | | | | | **Docmatix** | **2,444,750** | **9,500,000** | **390,000,000** | | DocVQA | 10,189 | 39,463 | 337,829 | | TextCaps | 21,953 | 21,953 | 389,658 | | TextVQA | 21,953 | 34,602 | 181,918 | | ST-VQA | 17,247 | 23,121 | 127,846 | | OCR-VQA | 165,746 | 801,579 | 6,073,824 | | VisualMRC | 3,027 | 11,988 | 168,828 | | IAM | 5,663 | 5,663 | 144,216 | | InfoVQA | 2,118 | 10,074 | 61,048 | | Diagram image-to-text | 300 | 300 | 22,196 | ## 引用 **BibTeX 引用格式：** bibtex @misc{laurençon2024building, title={构建并优化视觉语言模型：洞察与未来方向}, author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon}, year={2024}, eprint={2408.12637}, archivePrefix={arXiv}, primaryClass={cs.CV} }

应用场景：