NealCaren/newspaper-ocr-gold
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NealCaren/newspaper-ocr-gold
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- image-to-text
tags:
- ocr
- historical-newspapers
- fine-tuning
language:
- en
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: val
path: data/val-*
- split: test
path: data/test-*
dataset_info:
features:
- name: image
dtype: image
- name: transcription
dtype: string
- name: resolution
dtype: string
- name: scale
dtype: float64
- name: width
dtype: int64
- name: height
dtype: int64
- name: page_id
dtype: string
- name: line_id
dtype: string
- name: confidence
dtype: float64
- name: flag
dtype: string
splits:
- name: train
num_bytes: 717769970
num_examples: 51572
- name: val
num_bytes: 62524729
num_examples: 4487
- name: test
num_bytes: 98808848
num_examples: 6554
download_size: 871517347
dataset_size: 879103547
---
# newspaper-ocr-gold
Gold-standard OCR training data for historical newspaper scans.
## Contents
- 13,371 line-level transcriptions verified by Qwen3-VL 235B
- Line crop PNG images from 100 newspaper pages
- 73 unique titles spanning 1840s-2010s
- Train/val/test split by page (80/10/10)
## Splits
| Split | Pages | Lines |
|-------|-------|-------|
| train | 80 | 11,044 |
| val | 10 | 1,111 |
| test | 10 | 1,216 |
## Files
- `verified_lines.jsonl` — full metadata (split, page_id, line_id, crop_path, transcription, confidence, flag)
- `sample_metadata.json` — page sampling details (73 titles, decade distribution)
- `train_images.tar.gz`, `val_images.tar.gz`, `test_images.tar.gz` — line crop PNGs organized as `{split}/{page_id}/lines/line_NNNN.png`
## Usage
```python
from huggingface_hub import hf_hub_download
import tarfile, json
# Download verified labels
path = hf_hub_download("NealCaren/newspaper-ocr-gold", "verified_lines.jsonl", repo_type="dataset")
with open(path) as f:
lines = [json.loads(l) for l in f]
# Download and extract images for a split
for split in ["train", "val", "test"]:
tar = hf_hub_download("NealCaren/newspaper-ocr-gold", f"{split}_images.tar.gz", repo_type="dataset")
with tarfile.open(tar) as t:
t.extractall("./gold_data/")
```
## Quality
- 49% clean, 47% partial (line cut-off at word boundary), 3% degraded
- Mean confidence: 0.95
- Verified by Qwen3-VL 235B via OpenRouter (blind transcription, no OCR input)
许可协议:CC BY 4.0
任务类别:图像到文本(image-to-text)
标签:光学字符识别(OCR)、历史报纸(historical-newspapers)、微调(fine-tuning)
语言:英语(en)
样本量区间:10000 < 样本数 < 100000
配置项:
- 配置名称:default
数据文件:
- 训练集划分:data/train-*
- 验证集划分:data/val-*
- 测试集划分:data/test-*
数据集信息:
特征字段:
- 图像(image):数据类型为图像
- 转录文本(transcription):数据类型为字符串
- 分辨率(resolution):数据类型为字符串
- 缩放比例(scale):数据类型为float64
- 宽度(width):数据类型为int64
- 高度(height):数据类型为int64
- 页面ID(page_id):数据类型为字符串
- 行ID(line_id):数据类型为字符串
- 置信度(confidence):数据类型为float64
- 标记(flag):数据类型为字符串
数据集划分详情:
- 训练集:字节大小717769970,样本数51572
- 验证集:字节大小62524729,样本数4487
- 测试集:字节大小98808848,样本数6554
下载总大小:871517347
数据集总大小:879103547
# 报纸OCR黄金数据集(newspaper-ocr-gold)
## 内容概览
- 13371条经Qwen3-VL 235B验证的行级转录文本
- 源自100份报纸页面的行裁剪PNG图像
- 涵盖1840年代至2010年代的73种独特报刊标题
- 按页面比例80:10:10划分为训练/验证/测试集
## 数据集划分详情
| 数据集划分 | 页面数 | 行样本数 |
|-----------|--------|----------|
| 训练集 | 80 | 11044 |
| 验证集 | 10 | 1111 |
| 测试集 | 10 | 1216 |
## 文件说明
- `verified_lines.jsonl`:完整元数据文件,包含数据集划分、页面ID、行ID、裁剪图像路径、转录文本、置信度及标记信息
- `sample_metadata.json`:页面采样详情文件,涵盖73种报刊标题及年代分布信息
- `train_images.tar.gz`、`val_images.tar.gz`、`test_images.tar.gz`:行裁剪PNG图像压缩包,文件组织格式为`{split}/{page_id}/lines/line_NNNN.png`
## 使用示例
python
from huggingface_hub import hf_hub_download
import tarfile, json
# 下载验证后的标签
path = hf_hub_download("NealCaren/newspaper-ocr-gold", "verified_lines.jsonl", repo_type="dataset")
with open(path) as f:
lines = [json.loads(l) for l in f]
# 下载并解压指定划分的图像
for split in ["train", "val", "test"]:
tar = hf_hub_download("NealCaren/newspaper-ocr-gold", f"{split}_images.tar.gz", repo_type="dataset")
with tarfile.open(tar) as t:
t.extractall("./gold_data/")
## 数据质量
- 49%为干净文本,47%为部分截断文本(行在单词边界处被截断),3%为质量退化文本
- 平均置信度:0.95
- 通过OpenRouter平台由Qwen3-VL 235B完成验证,采用盲转录模式,无OCR输入
提供机构:
NealCaren



