DoclingMatix
收藏魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceM4/DoclingMatix
下载链接
链接失效反馈官方服务:
资源简介:
# DoclingMatix
DoclingMatix is a large-scale, multimodal dataset designed for training vision-language models in the domain of document intelligence. It was created specifically for training the SmolDocling model, an ultra-compact model for end-to-end document conversion.
The dataset is constructed by augmenting Hugging Face's [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix). Each sample in Docmatix, which consists of a document image and a few questions and answers about it, has been transformed. The text field is now prepended with an instructional prompt, guiding a model to convert the document image into our structured DocTag format. This "prompt-tuning" format makes DoclingMatix ideal for training instruction-following models on document-related tasks.
Document Conversion: The primary intended use is to train models that can take a document image as input and generate a structured text representation as output.
Document Visual Question Answering (VQA): The dataset can be adapted for VQA tasks by creating question-answer pairs based on the document's content and structure.
---
## Dataset Statistics
* **Total samples**: 1,270,911
* **Training set**: 1,270,911
* **Modalities**: Images, Text
---
## Intended Use
* Training multimodal models for **document conversion** and **document visual question answering**.
---
## Citation
If you use DoclingMatix, please cite:
```bibtex
@article{nassar2025smoldocling,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others},
journal={arXiv preprint arXiv:2503.11576},
year={2025}
}
```
# DoclingMatix
DoclingMatix 是一款大规模多模态数据集,专为文档智能领域的视觉语言模型训练打造,其开发初衷是用于训练 SmolDocling 模型——一款面向端到端文档转换的超紧凑型视觉语言模型。
本数据集基于 Hugging Face 的 [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) 数据集扩增构建。原 Docmatix 数据集的每个样本均包含一份文档图像及若干相关问答对,现已完成改造:在文本字段前新增指令提示词,引导模型将文档图像转换为结构化的 DocTag 格式。这种“指令微调”格式使得 DoclingMatix 非常适合训练可遵循指令的文档相关任务模型。
文档转换:其主要预期用途为训练以文档图像为输入、输出结构化文本表征的模型。
文档视觉问答(VQA):可基于文档内容与结构构建问答对,将该数据集适配至VQA任务中。
---
## 数据集统计
* **总样本量**:1,270,911
* **训练集**:1,270,911
* **模态**:图像、文本
---
## 预期用途
* 面向文档转换与文档视觉问答任务的多模态模型训练。
---
## 引用说明
若使用 DoclingMatix 数据集,请引用如下文献:
bibtex
@article{nassar2025smoldocling,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others},
journal={arXiv preprint arXiv:2503.11576},
year={2025}
}
提供机构:
maas
创建时间:
2025-08-01



