five

long_sparse_structured_table

收藏
魔搭社区2026-01-02 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/nanonets/long_sparse_structured_table
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is generated syhthetically to create tables with following characteristics: 1. Empty cell percentage in following range [40,70] (Sparse) 2. There is clear seperator between rows and columns (Structured). 3. 15 <= num rows <= 30, 7 <= num columns <= 15 (Long) ### Load the dataset ```python import io import pandas as pd from PIL import Image def bytes_to_image(self, image_bytes: bytes): return Image.open(io.BytesIO(image_bytes)) def parse_annotations(self, annotations: str) -> pd.DataFrame: return pd.read_json(StringIO(annotations), orient="records") test_data = load_dataset('nanonets/long_sparse_structured_table', split='test') data_point = test_data[0] image, gt_table = ( bytes_to_image(data_point["images"]), parse_annotations(data_point["annotation"]), ) ```

本数据集为合成生成数据集,旨在构建具备如下特征的表格: 1. 空单元格占比处于区间[40,70]内(稀疏型(Sparse)表格) 2. 行列间具备清晰分隔符(结构化(Structured)表格) 3. 行数范围为15 ≤ 行数 ≤ 30,列数范围为7 ≤ 列数 ≤ 15(长型(Long)表格) ### 数据集加载 python import io import pandas as pd from PIL import Image def bytes_to_image(self, image_bytes: bytes): return Image.open(io.BytesIO(image_bytes)) def parse_annotations(self, annotations: str) -> pd.DataFrame: return pd.read_json(StringIO(annotations), orient="records") test_data = load_dataset('nanonets/long_sparse_structured_table', split='test') data_point = test_data[0] image, gt_table = ( bytes_to_image(data_point["images"]), parse_annotations(data_point["annotation"]), )
提供机构:
maas
创建时间:
2025-06-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作