omoured/line-graphics-dataset

Name: omoured/line-graphics-dataset
Creator: omoured
Published: 2023-11-03 09:50:24
License: 暂无描述

Hugging Face2023-11-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/omoured/line-graphics-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: image dtype: image - name: image_name dtype: string - name: width dtype: int64 - name: height dtype: int64 - name: instances list: - name: category_id dtype: int64 - name: mask sequence: sequence: float64 splits: - name: train num_bytes: 8927542.0 num_examples: 200 - name: validation num_bytes: 4722935.0 num_examples: 100 - name: test num_bytes: 3984722.0 num_examples: 100 download_size: 16709320 dataset_size: 17635199.0 --- # Line Graphics (LG) dataset This is the official page for the LG dataset, as featured in our paper [Line Graphics Digitization: A Step Towards Full Automation](https://link.springer.com/chapter/10.1007/978-3-031-41734-4_27). By [Omar Moured](https://www.linkedin.com/in/omar-moured/) et al. ## Dataset Summary The dataset includes instance segmentation masks for **400 real line chart images, manually labeled into 11 categories** by professionals. These images were collected from 5 different professions to enhance diversity. In our paper, we studied two levels of segmentation: **coarse-level**, where we segmented (spines, axis-labels, legend, lines, titles), and **fine-level**, where we further segmented each category into x and y subclasses (except for legend and lines), and individually segmented each line. ## Category ID Reference ```python class_id_mapping = { "Label": 0, "Legend": 1, "Line": 2, "Spine": 3, "Title": 4, "ptitle": 5, "xlabel": 6, "xspine": 7, "xtitle": 8, "ylabel": 9, "yspine": 10, "ytitle": 11 } ``` ## Dataset structure (train, validation, test) - **image** - contains the PIL image of the chart - **image_name** - image name with PNG extension - **width** - original image width - **height** - original image height - **instances** - contains **n** number of labeled instances, each instance dictionary has {category_id, annotations}. **The annotations are in COCO format**. ## Sample Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("omoured/line-graphics-dataset") # Access the training split train_dataset = dataset["train"] # Print sample data print(dataset["train"][0]) ``` You can render the masks using `pycocotools` library as follows: ```python from pycocotools import mask polygon_coords = dataset['train'][0]['instances'][1]['mask'] image_width = dataset['validation'][0]['width'] image_height = dataset['validation'][0]['height'] mask_binary = mask.frPyObjects(polygon_coords, image_height, image_width) segmentation_mask = mask.decode(mask_binary) ``` ## Copyrights This dataset is published under the CC-BY 4.0 license, which allows for unrestricted usage, but it should be cited when used. ## Citation ```bibtex @inproceedings{moured2023line, title={Line Graphics Digitization: A Step Towards Full Automation}, author={Moured, Omar and Zhang, Jiaming and Roitberg, Alina and Schwarz, Thorsten and Stiefelhagen, Rainer}, booktitle={International Conference on Document Analysis and Recognition}, pages={438--453}, year={2023}, organization={Springer} } ``` ## Contact If you have any questions or need further assistance with this dataset, please feel free to contact us: - **Omar Moured**, omar.moured@kit.edu

提供机构：

omoured

原始信息汇总

Line Graphics (LG) 数据集

数据集概述

该数据集包含 400 张真实线图图像的实例分割掩码，这些图像由专业人员手动标记为 11 个类别。这些图像来自 5 种不同的职业，以增强多样性。在论文中，我们研究了两种级别的分割：粗略级别（包括脊柱、轴标签、图例、线条、标题）和精细级别，其中我们进一步将每个类别细分为 x 和 y 子类别（图例和线条除外），并单独分割每条线。

类别 ID 参考

python class_id_mapping = { "Label": 0, "Legend": 1, "Line": 2, "Spine": 3, "Title": 4, "ptitle": 5, "xlabel": 6, "xspine": 7, "xtitle": 8, "ylabel": 9, "yspine": 10, "ytitle": 11 }

数据集结构（训练、验证、测试）

image - 包含图表的 PIL 图像
image_name - 带有 PNG 扩展名的图像名称
width - 原始图像宽度
height - 原始图像高度
instances - 包含 n 个标记实例，每个实例字典包含 {category_id, annotations}。注释采用 COCO 格式。

示例用法

python from datasets import load_dataset

加载数据集

dataset = load_dataset("omoured/line-graphics-dataset")

访问训练分割

train_dataset = dataset["train"]

打印示例数据

print(dataset["train"][0])

可以使用 pycocotools 库渲染掩码，如下所示： python from pycocotools import mask

polygon_coords = dataset[train][0][instances][1][mask] image_width = dataset[validation][0][width] image_height = dataset[validation][0][height]

mask_binary = mask.frPyObjects(polygon_coords, image_height, image_width)

segmentation_mask = mask.decode(mask_binary)

版权

该数据集在 CC-BY 4.0 许可下发布，允许无限制使用，但使用时应引用。

引用

bibtex @inproceedings{moured2023line, title={Line Graphics Digitization: A Step Towards Full Automation}, author={Moured, Omar and Zhang, Jiaming and Roitberg, Alina and Schwarz, Thorsten and Stiefelhagen, Rainer}, booktitle={International Conference on Document Analysis and Recognition}, pages={438--453}, year={2023}, organization={Springer} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集