five

HuggingFaceM4/DoclingMatix

收藏
Hugging Face2025-07-31 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/HuggingFaceM4/DoclingMatix
下载链接
链接失效反馈
官方服务:
资源简介:
DoclingMatix是一个大规模多模态数据集,旨在训练文档智能领域的视觉-语言模型。该数据集通过增加指导模型将文档图像转换为结构化DocTag格式的指导性提示,对Docmatix数据集进行了转换,特别适合于训练遵循指令的模型处理与文档相关的任务。数据集的主要用途是训练能够将文档图像输入转换为结构化文本输出的模型,并且可以适应于基于文档内容和结构的视觉问题回答任务。

DoclingMatix is a large-scale, multimodal dataset designed for training vision-language models in the domain of document intelligence. It is specifically tailored for training the SmolDocling model, an ultra-compact model for end-to-end document conversion. The dataset is constructed by augmenting the Docmatix dataset with instructional prompts guiding the model to convert document images into a structured DocTag format, making it ideal for training instruction-following models on document-related tasks. The primary use is for training models capable of converting document images into structured text representations, and it can also be adapted for document visual question answering tasks based on the documents content and structure.
提供机构:
HuggingFaceM4
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作