snskrt/Shiv_puran_OCR
收藏Hugging Face2025-08-24 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/snskrt/Shiv_puran_OCR
下载链接
链接失效反馈官方服务:
资源简介:
Shlok vs Non Shlok in Shiv puran数据集用于帮助对象检测模型区分书籍中的颂歌(Shlok)和非颂歌内容。该数据集通过裁剪出颂歌部分,可以创建一个图像-文本的平行语料库。数据集由13Aluminium策划,遵循Apache2.0许可。数据来源于互联网档案馆的PDF文件,用于进一步开发梵文OCR。数据集结构包括注释文件,其中包含所有坐标信息,分为三个部分:Vidhyeshwar Samhita(手动注释并训练模型)、Rudra Samhita(使用模型进行推断并手动注释异常值)、Shat Rudra Samhita(使用模型进行推断并手动注释异常值)。
The Shlok vs Non Shlok in Shiv puran dataset is designed to help object detection models differentiate between Shlok and Non-Shlok in books. By cropping out the Shlok parts, a parallel image-text corpus can be created. The dataset is curated by 13Aluminium under the Apache2.0 license. The data source is from PDF files in the Internet Archive, used for further development of Sanskrit OCR. The dataset structure includes annotation files with all coordinate information, divided into three parts: Vidhyeshwar Samhita (manually annotated and trained a model), Rudra Samhita (using the model for inference and manually annotating outliers), Shat Rudra Samhita (using the model for inference and manually annotating outliers).
提供机构:
snskrt



