snskrt/Shiv_puran_OCR

Name: snskrt/Shiv_puran_OCR
Creator: snskrt
Published: 2025-08-24 22:02:55
License: 暂无描述

Hugging Face2025-08-24 更新2025-07-05 收录

下载链接：

https://hf-mirror.com/datasets/snskrt/Shiv_puran_OCR

下载链接

链接失效反馈

官方服务：

资源简介：

Shlok vs Non Shlok in Shiv puran数据集用于帮助对象检测模型区分书籍中的颂歌（Shlok）和非颂歌内容。该数据集通过裁剪出颂歌部分，可以创建一个图像-文本的平行语料库。数据集由13Aluminium策划，遵循Apache2.0许可。数据来源于互联网档案馆的PDF文件，用于进一步开发梵文OCR。数据集结构包括注释文件，其中包含所有坐标信息，分为三个部分：Vidhyeshwar Samhita（手动注释并训练模型）、Rudra Samhita（使用模型进行推断并手动注释异常值）、Shat Rudra Samhita（使用模型进行推断并手动注释异常值）。

The Shlok vs Non Shlok in Shiv puran dataset is designed to help object detection models differentiate between Shlok and Non-Shlok in books. By cropping out the Shlok parts, a parallel image-text corpus can be created. The dataset is curated by 13Aluminium under the Apache2.0 license. The data source is from PDF files in the Internet Archive, used for further development of Sanskrit OCR. The dataset structure includes annotation files with all coordinate information, divided into three parts: Vidhyeshwar Samhita (manually annotated and trained a model), Rudra Samhita (using the model for inference and manually annotating outliers), Shat Rudra Samhita (using the model for inference and manually annotating outliers).

提供机构：

snskrt

5,000+

优质数据集

54 个

任务类型

进入经典数据集