Teklia/NewsEye-Austrian-line
收藏Hugging Face2024-03-14 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/Teklia/NewsEye-Austrian-line
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- de
task_categories:
- image-to-text
pretty_name: NewsEye-Austrian-line
dataset_info:
features:
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_examples: 51588
- name: validation
num_examples: 4379
dataset_size: 55967
tags:
- atr
- htr
- ocr
- historical
- printed
---
# NewsEye Austrian - line level
## Table of Contents
- [NewsEye Austrian - line level](#newseye-austrian-line-level)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
## Dataset Description
- **Homepage:** [NewsEye project](https://www.newseye.eu/)
- **Source:** [Zenodo](https://zenodo.org/records/3387369)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
The dataset comprises Austrian newspaper pages from 19th and early 20th century. The images were provided by the Austrian National Library.
### Languages
The documents are in Austrian German with the Fraktur font.
Note that all images are resized to a fixed height of 128 pixels.
## Dataset Structure
### Data Instances
```
{
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190,
'text': 'Mann; und als wir uns zum Angriff stark genug'
}
```
### Data Fields
- `image`: a PIL.Image.Image object containing the image. Note that when accessing the image column (using dataset[0]["image"]), the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0].
- `text`: the label transcription of the image.
提供机构:
Teklia
原始信息汇总
NewsEye Austrian - line level 数据集概述
数据集描述
该数据集包含19世纪和20世纪初的奥地利报纸页面,图像由奥地利国家图书馆提供。文档语言为奥地利德语,使用Fraktur字体。所有图像被调整为固定高度128像素。
数据集结构
数据实例
json { "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190>, "text": "Mann; und als wir uns zum Angriff stark genug" }
数据字段
image: 包含图像的PIL.Image.Image对象。访问图像列时(例如 dataset[0]["image"]),图像文件会自动解码。解码大量图像文件可能需要较长时间,因此建议先查询样本索引再访问"image"列,即 dataset[0]["image"] 应优先于 dataset["image"][0]。text: 图像的标签转录文本。
数据集信息
- 特征:
image: 图像数据类型text: 字符串数据类型
- 分割:
train: 51588个样本validation: 4379个样本
- 数据集大小: 55967个样本
- 标签:
- atr
- htr
- ocr
- historical
- printed



