Teklia/NorHand-v1-line
收藏Hugging Face2024-03-14 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/Teklia/NorHand-v1-line
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- nb
task_categories:
- image-to-text
pretty_name: NorHand-v1-line
dataset_info:
features:
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_examples: 19653
- name: validation
num_examples: 2286
- name: test
num_examples: 1793
dataset_size: 23732
tags:
- atr
- htr
- ocr
- historical
- handwritten
---
# NorHand v1 - line level
## Table of Contents
- [NorHand v1 - line level](#norhand-v1-line-level)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
## Dataset Description
- **Homepage:** [Hugin-Munin project](https://hugin-munin-project.github.io/)
- **Source:** [Zenodo](https://zenodo.org/records/6542056)
- **Paper:** [A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_27)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
The NorHand v1 dataset comprises Norwegian letter and diary line images and text from 19th and early 20th century.
Note that all images are resized to a fixed height of 128 pixels.
### Languages
All the documents in the dataset are written in Norwegian Bokmål.
## Dataset Structure
### Data Instances
```
{
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190,
'text': 'fredag 1923'
}
```
### Data Fields
- `image`: a PIL.Image.Image object containing the image. Note that when accessing the image column (using dataset[0]["image"]), the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0].
- `text`: the label transcription of the image.
提供机构:
Teklia
原始信息汇总
NorHand v1 - line level 数据集概述
数据集描述
NorHand v1 数据集包含19世纪和20世纪初的挪威信件和日记行图像及其文本。所有图像都被调整为固定高度128像素。
语言
数据集中的所有文档均以挪威博克马尔语书写。
数据集结构
数据实例
json { "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190>, "text": "fredag 1923" }
数据字段
image: 一个包含图像的PIL.Image.Image对象。注意,当访问图像列(使用dataset[0]["image"])时,图像文件会自动解码。解码大量图像文件可能会花费大量时间,因此建议先查询样本索引再访问"image"列,即dataset[0]["image"]应始终优先于dataset["image"][0]。text: 图像的标签转录文本。
数据集信息
- 特征:
image: 图像类型,数据类型为image。text: 文本类型,数据类型为string。
- 分割:
train: 训练集,包含19653个样本。validation: 验证集,包含2286个样本。test: 测试集,包含1793个样本。
- 数据集大小: 23732个样本。
- 标签:
- atr
- htr
- ocr
- historical
- handwritten



