Dataset for Single Character Detection in Dongba Manuscripts

Name: Dataset for Single Character Detection in Dongba Manuscripts
Creator: figshare
Published: 2025-07-02 07:14:11
License: 暂无描述

DataCite Commons2025-07-02 更新2025-09-07 收录

下载链接：

https://springernature.figshare.com/articles/dataset/Dataset_for_Single_Character_Detection_in_Dongba_Manuscripts/26969755

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset for Single Character Detection in Dongba Manuscripts.It includes 1,800 curated JPEG image files and 1,800 text annotation files in TXT format. All files are named in a consistent format to ensure easy indexing and association between images and their corresponding annotations: JPEG images are named 'image_.jpg' (e.g., 'image_1.jpg'), and TXT files are named 'gt_image_.txt' (e.g., 'gt_image_1.txt'). In these TXT files, annotations of Dongba characters include a verified total of 111,702 characters, ensuring the accuracy and reliability of the data. Each character's spatial position is identified by a series of coordinate pairs that define the polygonal boundaries of the text boxes. For example, the coordinate sequence "161, 59, 202, 57, 256, 85, 239, 154, 182, 147, 163, 107" represents the vertices of a polygon, with each pair like "161, 59" indicating the x and y coordinates of a vertex. Coordinates are typically listed in a clockwise direction to comprehensively outline the full contour of the polygon. To differentiate between records, the annotation files use "###" as a delimiter to signify the end of a record. Additionally, to enhance the usability and applicability of the dataset, all data are stored and transmitted in standard formats, enabling researchers to readily use these data for training and testing machine learning models. By providing these detailed data records and formatting specifications, the Dongba1800 dataset not only supports the preservation and research of Dongba script and related cultural heritage but also offers valuable resources for technological development in related fields.

东巴手稿单字符检测数据集（Dataset for Single Character Detection in Dongba Manuscripts）包含1800份精选JPEG图像文件与1800份TXT格式的文本标注文件。所有文件采用统一命名格式，便于图像与对应标注的索引与关联：JPEG图像命名为"image_*.jpg"（例如"image_1.jpg"），TXT标注文件命名为"gt_image_*.txt"（例如"gt_image_1.txt"）。该数据集的TXT标注文件中总计标注了经核实的111702个东巴文字符，保障了数据的准确性与可靠性。每个字符的空间位置通过一系列坐标对进行标识，用于勾勒文本框的多边形边界。例如坐标序列"161, 59, 202, 57, 256, 85, 239, 154, 182, 147, 163, 107"代表一个多边形的顶点，每一组形如"161, 59"的坐标分别表示一个顶点的x轴与y轴像素坐标。坐标通常按顺时针顺序排列，以完整覆盖多边形的外轮廓。为区分不同标注记录，标注文件以"###"作为记录结束分隔符。此外，为提升数据集的易用性与适用性，所有数据均采用标准格式存储与传输，便于研究人员直接将其用于机器学习模型的训练与测试。本东巴1800（Dongba1800）数据集通过提供详尽的数据记录与格式规范，既为东巴文字及相关文化遗产的保护与研究提供了有力支撑，也为相关领域的技术发展提供了宝贵的资源。

提供机构：

figshare

创建时间：

2024-09-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集