johnlockejrr/samaritan_v1

Name: johnlockejrr/samaritan_v1
Creator: johnlockejrr
Published: 2024-07-01 10:12:48
License: 暂无描述

Hugging Face2024-07-01 更新2024-07-06 收录

下载链接：

https://hf-mirror.com/datasets/johnlockejrr/samaritan_v1

下载链接

链接失效反馈

官方服务：

资源简介：

Samaritan_v1数据集包含14世纪和17世纪初的撒玛利亚圣经手稿的行图像和文本。所有图像都被调整为固定的128像素高度。数据集中的文档使用希伯来语、撒玛利亚希伯来语和撒玛利亚阿拉姆语书写。数据集结构包括数据实例和数据字段，其中数据实例包含图像和文本，数据字段详细说明了图像和文本的处理方式。

The Samaritanv1 dataset comprises Samaritan Biblical manuscripts line images and text from 14th and early 17th century. All images are resized to a fixed height of 128 pixels. The dataset is written in Hebrew, Samaritan Hebrew and Samaritan Aramaic. The dataset structure includes images and corresponding text transcriptions, where the image part is a PIL.Image.Image object and the text part is the label transcription of the image. The dataset is divided into train, validation, and test sets, containing 2500, 1500, and 965 samples respectively.

提供机构：

johnlockejrr

原始信息汇总

Samaritan v1 - line level 数据集概述

数据集描述

Samaritanv1 数据集包含来自14世纪和17世纪早期的 Samaritan 圣经手稿的行图像和文本。所有图像都被调整为固定高度128像素。

语言

数据集中的所有文档均以希伯来语、Samaritan 希伯来语和 Samaritan 亚拉姆语书写。

数据集结构

数据实例

json { "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190>, "text": "ינבו הנש תאמו םישלשו" }

数据字段

image: 包含图像的 PIL.Image.Image 对象。访问图像列时（例如使用 dataset[0]["image"]），图像文件会自动解码。解码大量图像文件可能需要较长时间，因此建议先查询样本索引再访问 "image" 列，即 dataset[0]["image"] 应优先于 dataset["image"][0]。
text: 图像的标签转录文本。由于 PyLaia 库的限制，文本从 RTL 翻转为 LTR。

数据集信息

特征：
- image: 图像类型，数据类型为 image。
- text: 文本类型，数据类型为 string。
分割：
- train: 训练集，包含 2500 个样本。
- validation: 验证集，包含 1500 个样本。
- test: 测试集，包含 965 个样本。
数据集大小：415M。