long_sparse_unstructured_table
收藏魔搭社区2026-01-06 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/nanonets/long_sparse_unstructured_table
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is generated syhthetically to create tables with following characteristics:
1. Empty cell percentage in following range [40,70] (Sparse)
2. There is no seperator between rows and columns (un-structured).
3. 15 <= num rows <= 30, 7 <= num columns <= 15 (Long)
### Load the dataset
```python
import io
import pandas as pd
from PIL import Image
def bytes_to_image(self, image_bytes: bytes):
return Image.open(io.BytesIO(image_bytes))
def parse_annotations(self, annotations: str) -> pd.DataFrame:
return pd.read_json(StringIO(annotations), orient="records")
test_data = load_dataset('nanonets/long_sparse_unstructured_table', split='test')
data_point = test_data[0]
image, gt_table = (
bytes_to_image(data_point["images"]),
parse_annotations(data_point["annotation"]),
)
```
本数据集为人工合成生成,旨在构建具备如下特征的表格:
1. 空单元格占比处于区间[40, 70](稀疏型表格)
2. 行列间无分隔符,属于非结构化表格
3. 行数范围为15 ≤ 行数 ≤ 30,列数范围为7 ≤ 列数 ≤ 15(长型表格)
### 数据集加载
python
import io
import pandas as pd
from PIL import Image
def bytes_to_image(self, image_bytes: bytes):
return Image.open(io.BytesIO(image_bytes))
def parse_annotations(self, annotations: str) -> pd.DataFrame:
return pd.read_json(StringIO(annotations), orient="records")
test_data = load_dataset('nanonets/long_sparse_unstructured_table', split='test')
data_point = test_data[0]
image, gt_table = (
bytes_to_image(data_point["images"]),
parse_annotations(data_point["annotation"]),
)
提供机构:
maas
创建时间:
2025-06-13



