Teklia/NorHand-v3-line
收藏Hugging Face2024-03-14 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/Teklia/NorHand-v3-line
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- nb
task_categories:
- image-to-text
pretty_name: NorHand-v3-line
dataset_info:
features:
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_examples: 222381
- name: validation
num_examples: 22679
- name: test
num_examples: 1562
dataset_size: 246622
tags:
- atr
- htr
- ocr
- historical
- handwritten
---
# NorHand v3 - line level
## Table of Contents
- [NorHand v3 - line level](#norhand-v3-line-level)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
## Dataset Description
- **Homepage:** [Hugin-Munin project](https://hugin-munin-project.github.io/)
- **Source:** [Zenodo](https://zenodo.org/records/10255840)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
The NorHand v3 dataset comprises Norwegian letter and diary line images and text from 19th and early 20th century.
Note that all images are resized to a fixed height of 128 pixels.
### Languages
All the documents in the dataset are written in Norwegian Bokmål.
## Dataset Structure
### Data Instances
```
{
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190,
'text': 'Til Bestyrelsen af'
}
```
### Data Fields
- `image`: a PIL.Image.Image object containing the image. Note that when accessing the image column (using dataset[0]["image"]), the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0].
- `text`: the label transcription of the image.
提供机构:
Teklia
原始信息汇总
数据集概述
数据集名称
NorHand-v3-line
数据集简介
NorHand v3 数据集包含19世纪和20世纪初的挪威信件和日记行图像及其文本。所有图像都被调整为固定高度128像素。
语言
数据集中的所有文档均以挪威博克马尔语(Norwegian Bokmål)书写。
数据集结构
数据实例
json { image: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190>, text: Til Bestyrelsen af }
数据字段
image: 包含图像的PIL.Image.Image对象。注意,访问图像列时(使用dataset[0]["image"]),图像文件会自动解码。解码大量图像文件可能需要较长时间,因此建议先查询样本索引再访问"image"列,即dataset[0]["image"]应始终优于dataset["image"][0]。text: 图像的标签转录文本。
数据集分割
train: 222381个样本validation: 22679个样本test: 1562个样本
数据集大小
246622个样本
标签
- atr
- htr
- ocr
- historical
- handwritten



