long_sparse_unstructured_table

Name: long_sparse_unstructured_table
Creator: maas
Published: 2026-01-06 16:35:40
License: 暂无描述

魔搭社区2026-01-06 更新2025-06-14 收录

下载链接：

https://modelscope.cn/datasets/nanonets/long_sparse_unstructured_table

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is generated syhthetically to create tables with following characteristics: 1. Empty cell percentage in following range [40,70] (Sparse) 2. There is no seperator between rows and columns (un-structured). 3. 15 <= num rows <= 30, 7 <= num columns <= 15 (Long) ### Load the dataset ```python import io import pandas as pd from PIL import Image def bytes_to_image(self, image_bytes: bytes): return Image.open(io.BytesIO(image_bytes)) def parse_annotations(self, annotations: str) -> pd.DataFrame: return pd.read_json(StringIO(annotations), orient="records") test_data = load_dataset('nanonets/long_sparse_unstructured_table', split='test') data_point = test_data[0] image, gt_table = ( bytes_to_image(data_point["images"]), parse_annotations(data_point["annotation"]), ) ```

本数据集为人工合成生成，旨在构建具备如下特征的表格： 1. 空单元格占比处于区间[40, 70]（稀疏型表格） 2. 行列间无分隔符，属于非结构化表格 3. 行数范围为15 ≤ 行数 ≤ 30，列数范围为7 ≤ 列数 ≤ 15（长型表格） ### 数据集加载 python import io import pandas as pd from PIL import Image def bytes_to_image(self, image_bytes: bytes): return Image.open(io.BytesIO(image_bytes)) def parse_annotations(self, annotations: str) -> pd.DataFrame: return pd.read_json(StringIO(annotations), orient="records") test_data = load_dataset('nanonets/long_sparse_unstructured_table', split='test') data_point = test_data[0] image, gt_table = ( bytes_to_image(data_point["images"]), parse_annotations(data_point["annotation"]), )

提供机构：

maas

创建时间：

2025-06-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集