five

Jonnob/the-spiritualist-enriched

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Jonnob/the-spiritualist-enriched
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 task_categories: - image-to-text tags: - ocr - historical-newspapers - layout-analysis - spiritualism - 19th-century pretty_name: The Spiritualist Enriched --- # The Spiritualist Enriched Ground truth OCR annotations for *The Spiritualist*, a 19th-century British newspaper. An enriched version of the original derived from Transkribus. Aadditional manually labelled with, classes, column types, reading order and [Semantic Structural Units (SSUs)](https://arxiv.org/abs/2603.12718). The page images and original ground truth are available from the companion dataset: [NationalLibraryOfScotland/Spiritualist_Newspaper](https://huggingface.co/datasets/NationalLibraryOfScotland/Spiritualist_Newspaper) ## Contents | File | Description | |------|-------------| | `annotations/gt_ssu_bboxes.parquet` | 425 SSU regions across 49 pages: bounding boxes, polygon points, SSU label, ground truth text | | `annotations/characters_inferred.parquet` | ~827k character-level annotations with per-character bounding boxes and SSU label | | `alto_xml/ocr_gt_labelled.zip` | Raw ALTO XML files (ALTO v4, as exported from Transkribus) | ## Schema — `gt_ssu_bboxes.parquet` | Column | Type | Description | |--------|------|-------------| | `filename` | string | Image filename (matches the companion image dataset) | | `page_id` | string | Page identifier | | `image_width` | int | Page image width in pixels | | `image_height` | int | Page image height in pixels | | `x` | int | Bounding box left edge | | `y` | int | Bounding box top edge | | `width` | int | Bounding box width | | `height` | int | Bounding box height | | `polygon_points` | string | Space-separated polygon vertices (`x,y` pairs) | | `ssu_id` | string | Semantic Structural Unit label (e.g. `ssu_masthead`, `ssu_1_col_1`) | | `gt_text` | string | Ground truth transcription for the region | ## Schema — `characters_inferred.parquet` | Column | Type | Description | |--------|------|-------------| | `char_id` | string | Unique character identifier | | `page_id` | string | Page identifier | | `char_text` | string | Character | | `x` | float | Character bounding box left edge | | `y` | float | Character bounding box top edge | | `w` | float | Character bounding box width | | `h` | float | Character bounding box height | | `ssu_id` | string | SSU label for the containing region | ## Citation If you use this dataset please cite ``` @misc{bourne2026charactererrorvector, title={The Character Error Vector: Decomposable errors for page-level OCR evaluation}, author={Jonathan Bourne and Mwiza Simbeye and Joseph Nockels}, year={2026}, eprint={2604.06160}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://doi.org/10.48550/arXiv.2604.06160} } ```

语言: - 英语(en) 许可协议:CC BY 4.0(知识共享署名4.0) 任务类别: - 图像到文本 标签: - 光学字符识别(Optical Character Recognition,OCR) - 历史报纸 - 版面分析 - 唯灵论 - 19世纪 展示名称:《富集版唯灵论者报》(The Spiritualist Enriched) # 《富集版唯灵论者报》(The Spiritualist Enriched) 本数据集为19世纪英国报纸《The Spiritualist》(《唯灵论者报》)的真实光学字符识别标注数据,是基于Transkribus平台导出的原始版本经富集增强后的产物。此外还通过人工标注了类别、栏位类型、阅读顺序以及语义结构单元(Semantic Structural Units,SSUs),相关技术细节可参考论文:https://arxiv.org/abs/2603.12718。 页面图像与原始真实标注数据可从配套数据集获取:[NationalLibraryOfScotland/Spiritualist_Newspaper](https://huggingface.co/datasets/NationalLibraryOfScotland/Spiritualist_Newspaper) ## 数据集内容 | 文件路径 | 描述信息 | |----------|----------| | `annotations/gt_ssu_bboxes.parquet` | 覆盖49页的425个语义结构单元区域:包含边界框坐标、多边形顶点、语义结构单元标签以及区域真实标注文本 | | `annotations/characters_inferred.parquet` | 约82.7万个字符级标注数据,包含每个字符的边界框与所属语义结构单元标签 | | `alto_xml/ocr_gt_labelled.zip` | 原始ALTO XML文件(ALTO v4版本,由Transkribus平台导出) | ## 数据模式 — `gt_ssu_bboxes.parquet` | 列名 | 数据类型 | 描述 | |--------|----------|------| | `filename` | 字符串 | 图像文件名(与配套图像数据集完全匹配) | | `page_id` | 字符串 | 页面唯一标识符 | | `image_width` | 整数 | 页面图像宽度(单位:像素) | | `image_height` | 整数 | 页面图像高度(单位:像素) | | `x` | 整数 | 边界框左边缘横坐标 | | `y` | 整数 | 边界框上边缘纵坐标 | | `width` | 整数 | 边界框宽度 | | `height` | 整数 | 边界框高度 | | `polygon_points` | 字符串 | 以空格分隔的多边形顶点坐标(格式为`x,y`坐标对) | | `ssu_id` | 字符串 | 语义结构单元标签(例如`ssu_masthead`、`ssu_1_col_1`) | | `gt_text` | 字符串 | 对应区域的真实转录文本 | ## 数据模式 — `characters_inferred.parquet` | 列名 | 数据类型 | 描述 | |--------|----------|------| | `char_id` | 字符串 | 唯一字符标识符 | | `page_id` | 字符串 | 页面唯一标识符 | | `char_text` | 字符串 | 单个字符内容 | | `x` | 浮点数 | 字符边界框左边缘横坐标 | | `y` | 浮点数 | 字符边界框上边缘纵坐标 | | `w` | 浮点数 | 字符边界框宽度 | | `h` | 浮点数 | 字符边界框高度 | | `ssu_id` | 字符串 | 所属区域的语义结构单元标签 | ## 引用说明 若您使用本数据集,请引用以下文献: @misc{bourne2026charactererrorvector, title={字符误差向量:面向页面级OCR评估的可分解误差}, author={Jonathan Bourne, Mwiza Simbeye, Joseph Nockels}, year={2026}, eprint={2604.06160}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://doi.org/10.48550/arXiv.2604.06160} }
提供机构:
Jonnob
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作