XFUND_DoclingDocument
收藏魔搭社区2025-10-01 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/XFUND_DoclingDocument
下载链接
链接失效反馈官方服务:
资源简介:
# XFUND DoclingDocument Dataset
This repository contains a private version of the XFUND dataset processed into the [**DoclingDocument**](https://ds4sd.github.io/docling/concepts/docling_document/) format using [Docling](https://github.com/DS4SD/docling).
## Dataset Overview
- **Total Samples:** 1393 annotated forms
- **Splits:**
- **train:** 1043 samples
- **validation:** 350 samples
- **Features:**
- `docling_version`: Version of the Docling document format with which the document is processed.
- `document_id`: Unique identifier for each document.
- `GroundTruthDocument`: The **DoclingDocument** object containing the annotations.
- `GroundTruthPageImages`: Images corresponding to each page.
- `GroundTruthPictures`: Pictures present in the document.
- `BinaryDocument`: Image of the form in binary format.
- `mimetype`: MIME type of the binary document.
## Source
The original XFUND dataset was introduced in the paper: [XFUND](https://aclanthology.org/2022.findings-acl.253.pdf). It consists of annotated forms in 7 languages: `DE, FR, IT, JA, ZH, PT, ES`.
## Docling Framework
The conversion to the DoclingDocument format was performed using the [Docling](https://github.com/DS4SD/docling), which provides a standardized way to represent a document.
# XFUND DoclingDocument 数据集
本仓库包含经 [Docling](https://github.com/DS4SD/docling) 工具处理为 **DoclingDocument** 格式的私有版 XFUND 数据集,相关格式说明可参考 [DoclingDocument 概念文档](https://ds4sd.github.io/docling/concepts/docling_document/)。
## 数据集概览
- **总样本量:** 1393 份标注表单
- **拆分方式:**
- **训练集:** 1043 份样本
- **验证集:** 350 份样本
- **数据集字段:**
- `docling_version`:处理文档所用的 DoclingDocument 格式版本
- `document_id`:每份文档的唯一标识符
- `GroundTruthDocument`:包含标注信息的 **DoclingDocument** 对象
- `GroundTruthPageImages`:对应文档各页面的图像
- `GroundTruthPictures`:文档中包含的图片
- `BinaryDocument`:二进制格式的表单图像
- `mimetype`:二进制文档的 MIME 类型
## 数据集来源
原始 XFUND 数据集首次提出于论文 [XFUND](https://aclanthology.org/2022.findings-acl.253.pdf),其包含 7 种语言的标注表单,对应语言代码为 `DE、FR、IT、JA、ZH、PT、ES`。
## Docling 框架
本数据集通过 [Docling](https://github.com/DS4SD/docling) 工具转换为 **DoclingDocument** 格式,该工具提供了标准化的文档表示方案。
提供机构:
maas
创建时间:
2025-03-13



