biglam/bnl_ground_truth_newspapers_before_1878
收藏Hugging Face2022-08-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/biglam/bnl_ground_truth_newspapers_before_1878
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
---
### Dataset description
33.000 transcribed text lines from historical newspapers (before 1878) along with the cropped images of the original scans
Text line based OCR
19.000 text lines in Antiqua
14.000 text lines in Fraktur
Transcribed using double-keying (99.95% accuracy)
Public Domain, CC0 (See copyright notice)
Best for training an OCR engine
The newspapers used are:
- Le Gratis luxembourgeois (1857-1858)
- Luxemburger Volks-Freund (1869-1876)
- L'Arlequin (1848-1848)
- Courrier du Grand-Duché de Luxembourg (1844-1868)
- L'Avenir (1868-1871)
- Der Wächter an der Sauer (1849-1869)
- Luxemburger Zeitung (1844-1845)
- Luxemburger Zeitung = Journal de Luxembourg (1858-1859)
- Der Volksfreund (1848-1849)
- Cäcilia (1862-1871)
- Kirchlicher Anzeiger für die Diözese Luxemburg (1871-1878)
- L'Indépendance luxembourgeoise (1871-1878)
- Luxemburger Anzeiger (1856)
- L'Union (1860-1871)
- Diekircher Wochenblatt (1837-1848)
- Das Vaterland (1869-1870)
- D'Wäschfra (1868-1878)
- Luxemburger Bauernzeitung (1857)
- Luxemburger Wort (1848-1878)
### URL for this dataset
https://data.bnl.lu/data/historical-newspapers/
### Dataset format
Two JSONL files (antiqua.jsonl.gz and fraktur.jsonl.gz) with the follwing fields:
- `font` is either antiqua or fraktur
- `img` is the filename of the associated image for the text
- `text` is the handcorrected double-keyed text transcribed from the image
Sample:
```json
{
"font": "fraktur",
"img": "fraktur-000011.png",
"text": "Vidal die Vollmacht für Paris an. Auch"
}
```
In addition there are two `.zip` files with the associated images
### Dataset modality
Text and associated Images from Scans
### Dataset licence
Creative Commons Public Domain Dedication and Certification
### size of dataset
500MB-2GB
### Contact details for data custodian
opendata@bnl.etat.lu
提供机构:
biglam
原始信息汇总
数据集描述
- 数据内容:包含33,000条来自历史报纸(1878年之前)的转录文本行及原始扫描图像的裁剪图。
- 文本类型:
- 19,000条文本行使用Antiqua字体
- 14,000条文本行使用Fraktur字体
- 转录方式:采用双键转录,准确率达99.95%。
- 版权状态:公共领域,CC0许可。
- 适用场景:最适合用于训练OCR引擎。
数据集来源
- 报纸列表:
- Le Gratis luxembourgeois (1857-1858)
- Luxemburger Volks-Freund (1869-1876)
- LArlequin (1848-1848)
- Courrier du Grand-Duché de Luxembourg (1844-1868)
- LAvenir (1868-1871)
- Der Wächter an der Sauer (1849-1869)
- Luxemburger Zeitung (1844-1845)
- Luxemburger Zeitung = Journal de Luxembourg (1858-1859)
- Der Volksfreund (1848-1849)
- Cäcilia (1862-1871)
- Kirchlicher Anzeiger für die Diözese Luxemburg (1871-1878)
- LIndépendance luxembourgeoise (1871-1878)
- Luxemburger Anzeiger (1856)
- LUnion (1860-1871)
- Diekircher Wochenblatt (1837-1848)
- Das Vaterland (1869-1870)
- DWäschfra (1868-1878)
- Luxemburger Bauernzeitung (1857)
- Luxemburger Wort (1848-1878)
数据集格式
- 文件类型:两个JSONL文件(antiqua.jsonl.gz和fraktur.jsonl.gz)。
- 文件内容:
font:字体类型,antiqua或frakturimg:关联图像的文件名text:从图像转录的手工校正双键文本
数据集示例
json { "font": "fraktur", "img": "fraktur-000011.png", "text": "Vidal die Vollmacht für Paris an. Auch" }
数据集大小
- 大小范围:500MB至2GB。
数据集版权
- 版权许可:Creative Commons Public Domain Dedication and Certification。



