Jonnob/the-spiritualist-enriched
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Jonnob/the-spiritualist-enriched
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- image-to-text
tags:
- ocr
- historical-newspapers
- layout-analysis
- spiritualism
- 19th-century
pretty_name: The Spiritualist Enriched
---
# The Spiritualist Enriched
Ground truth OCR annotations for *The Spiritualist*, a 19th-century British newspaper. An enriched version of the original derived from Transkribus.
Aadditional manually labelled with, classes, column types, reading order and [Semantic Structural Units (SSUs)](https://arxiv.org/abs/2603.12718).
The page images and original ground truth are available from the companion dataset:
[NationalLibraryOfScotland/Spiritualist_Newspaper](https://huggingface.co/datasets/NationalLibraryOfScotland/Spiritualist_Newspaper)
## Contents
| File | Description |
|------|-------------|
| `annotations/gt_ssu_bboxes.parquet` | 425 SSU regions across 49 pages: bounding boxes, polygon points, SSU label, ground truth text |
| `annotations/characters_inferred.parquet` | ~827k character-level annotations with per-character bounding boxes and SSU label |
| `alto_xml/ocr_gt_labelled.zip` | Raw ALTO XML files (ALTO v4, as exported from Transkribus) |
## Schema — `gt_ssu_bboxes.parquet`
| Column | Type | Description |
|--------|------|-------------|
| `filename` | string | Image filename (matches the companion image dataset) |
| `page_id` | string | Page identifier |
| `image_width` | int | Page image width in pixels |
| `image_height` | int | Page image height in pixels |
| `x` | int | Bounding box left edge |
| `y` | int | Bounding box top edge |
| `width` | int | Bounding box width |
| `height` | int | Bounding box height |
| `polygon_points` | string | Space-separated polygon vertices (`x,y` pairs) |
| `ssu_id` | string | Semantic Structural Unit label (e.g. `ssu_masthead`, `ssu_1_col_1`) |
| `gt_text` | string | Ground truth transcription for the region |
## Schema — `characters_inferred.parquet`
| Column | Type | Description |
|--------|------|-------------|
| `char_id` | string | Unique character identifier |
| `page_id` | string | Page identifier |
| `char_text` | string | Character |
| `x` | float | Character bounding box left edge |
| `y` | float | Character bounding box top edge |
| `w` | float | Character bounding box width |
| `h` | float | Character bounding box height |
| `ssu_id` | string | SSU label for the containing region |
## Citation
If you use this dataset please cite
```
@misc{bourne2026charactererrorvector,
title={The Character Error Vector: Decomposable errors for page-level OCR evaluation},
author={Jonathan Bourne and Mwiza Simbeye and Joseph Nockels},
year={2026},
eprint={2604.06160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://doi.org/10.48550/arXiv.2604.06160}
}
```
语言:
- 英语(en)
许可协议:CC BY 4.0(知识共享署名4.0)
任务类别:
- 图像到文本
标签:
- 光学字符识别(Optical Character Recognition,OCR)
- 历史报纸
- 版面分析
- 唯灵论
- 19世纪
展示名称:《富集版唯灵论者报》(The Spiritualist Enriched)
# 《富集版唯灵论者报》(The Spiritualist Enriched)
本数据集为19世纪英国报纸《The Spiritualist》(《唯灵论者报》)的真实光学字符识别标注数据,是基于Transkribus平台导出的原始版本经富集增强后的产物。此外还通过人工标注了类别、栏位类型、阅读顺序以及语义结构单元(Semantic Structural Units,SSUs),相关技术细节可参考论文:https://arxiv.org/abs/2603.12718。
页面图像与原始真实标注数据可从配套数据集获取:[NationalLibraryOfScotland/Spiritualist_Newspaper](https://huggingface.co/datasets/NationalLibraryOfScotland/Spiritualist_Newspaper)
## 数据集内容
| 文件路径 | 描述信息 |
|----------|----------|
| `annotations/gt_ssu_bboxes.parquet` | 覆盖49页的425个语义结构单元区域:包含边界框坐标、多边形顶点、语义结构单元标签以及区域真实标注文本 |
| `annotations/characters_inferred.parquet` | 约82.7万个字符级标注数据,包含每个字符的边界框与所属语义结构单元标签 |
| `alto_xml/ocr_gt_labelled.zip` | 原始ALTO XML文件(ALTO v4版本,由Transkribus平台导出) |
## 数据模式 — `gt_ssu_bboxes.parquet`
| 列名 | 数据类型 | 描述 |
|--------|----------|------|
| `filename` | 字符串 | 图像文件名(与配套图像数据集完全匹配) |
| `page_id` | 字符串 | 页面唯一标识符 |
| `image_width` | 整数 | 页面图像宽度(单位:像素) |
| `image_height` | 整数 | 页面图像高度(单位:像素) |
| `x` | 整数 | 边界框左边缘横坐标 |
| `y` | 整数 | 边界框上边缘纵坐标 |
| `width` | 整数 | 边界框宽度 |
| `height` | 整数 | 边界框高度 |
| `polygon_points` | 字符串 | 以空格分隔的多边形顶点坐标(格式为`x,y`坐标对) |
| `ssu_id` | 字符串 | 语义结构单元标签(例如`ssu_masthead`、`ssu_1_col_1`) |
| `gt_text` | 字符串 | 对应区域的真实转录文本 |
## 数据模式 — `characters_inferred.parquet`
| 列名 | 数据类型 | 描述 |
|--------|----------|------|
| `char_id` | 字符串 | 唯一字符标识符 |
| `page_id` | 字符串 | 页面唯一标识符 |
| `char_text` | 字符串 | 单个字符内容 |
| `x` | 浮点数 | 字符边界框左边缘横坐标 |
| `y` | 浮点数 | 字符边界框上边缘纵坐标 |
| `w` | 浮点数 | 字符边界框宽度 |
| `h` | 浮点数 | 字符边界框高度 |
| `ssu_id` | 字符串 | 所属区域的语义结构单元标签 |
## 引用说明
若您使用本数据集,请引用以下文献:
@misc{bourne2026charactererrorvector,
title={字符误差向量:面向页面级OCR评估的可分解误差},
author={Jonathan Bourne, Mwiza Simbeye, Joseph Nockels},
year={2026},
eprint={2604.06160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://doi.org/10.48550/arXiv.2604.06160}
}
提供机构:
Jonnob



