OCRFlux-pubtabnet-single
收藏魔搭社区2026-01-06 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/ChatDOC/OCRFlux-pubtabnet-single
下载链接
链接失效反馈官方服务:
资源简介:
# OCRFlux-pubtabnet-single
OCRFlux-pubtabnet-single is a benchmark of 9064 table images and their corresponding ground-truth HTML, which are derived from the public [PubTabNet](https://github.com/ibm-aur-nlp/PubTabNet) benchmark with some format transformations.
This dataset can be used to measure the performance of OCR systems in single-page table parsing.
Quick links:
- 🤗 [Model](https://huggingface.co/ChatDOC/OCRFlux-3B)
- 🛠️ [Code](https://github.com/chatdoc-com/OCRFlux)
## Data Mix
## Table 1: Tables breakdown by complexity (whether they contain rowspan or colspan cells)
| Complexity | Number |
|--------|-------------|
| Simple | 4623 |
| Complex | 4441 |
| **Total** | **9064** |
## Data Format
Each row in the dataset corresponds to a table image and its corresponding ground-truth HTML.
Different from the original PubTabNet dataset, we do not distinguish cells in the table headers and table bodies, which means there are no `<thead>` and `<tbody>` tags, and all `<th>` tags are replaced by `<td>` tags.
### Features:
```python
{
'image_name': string, # Name of the table image
'type': string, # "simple" or "complex"
'gt_table': string, # Ground-truth HTML of the table
}
```
## License
This dataset is licensed under Apache-2.0.
# OCRFlux-pubtabnet-single
OCRFlux-pubtabnet-single 是一款包含9064张表格图像及其对应真值(ground-truth)HTML的基准数据集,其源自公开的PubTabNet基准数据集,并经过了部分格式转换。
该数据集可用于评估光学字符识别(Optical Character Recognition,OCR)系统在单页表格解析任务中的性能。
快速链接:
- 🤗 [模型](https://huggingface.co/ChatDOC/OCRFlux-3B)
- 🛠️ [代码](https://github.com/chatdoc-com/OCRFlux)
## 数据构成
## 表1:按复杂度划分的表格统计(是否包含跨行或跨列单元格)
| 复杂度 | 数量 |
|--------|-------------|
| 简单 | 4623 |
| 复杂 | 4441 |
| **总计** | **9064** |
## 数据格式
数据集中的每一条样本对应一张表格图像及其对应的真值HTML。
与原始PubTabNet数据集不同,本数据集不对表格表头与表体中的单元格进行区分,即不使用`<thead>`与`<tbody>`标签,且所有`<th>`标签均替换为`<td>`标签。
### 字段说明
python
{
'image_name': string, # 表格图像的文件名
'type': string, # 取值为"simple"或"complex"
'gt_table': string, # 表格的真值HTML
}
## 许可协议
本数据集采用Apache-2.0开源许可协议。
提供机构:
maas
创建时间:
2025-07-01



