five

Shahmukhi Database SMDB- SMHaroof V1

收藏
ieee-dataport.org2025-03-25 收录
下载链接:
https://ieee-dataport.org/documents/shahmukhi-database-smdb-smharoof-v1
下载链接
链接失效反馈
官方服务:
资源简介:
The greatest challenge of machine learning problems is to select suitable techniques and resources such as tools and datasets. Despite the existence of millions of speakers around the globe and the rich literary history of more than a thousand years, it is expensive to find the computational linguistic work related to Punjabi Shahmukhi script, a member of the Perso-Arabic context-specific script low-resource language family. The selection of the best algorithm for a machine learning problem heavily depends on the availability of a dataset for that specific task. We present a novel, custom-built, and first-of-its-kind dataset for Punjabi in Shahmukhi script, its design, development, and validation process using Artificial Neural Networks. The dataset uses up to 40 classes, in multiple fonts, including Nasta’leeq, Naskh, and Arabic Type, etc, many font sizes and has been presented in many sub sizes. The dataset has been designed with a special dataset construction process by which researchers can make changes in the dataset as per their requirements.* The dataset construction program can also perform data augmentation to generate millions of images for a machine learning algorithm with different parameters including font type, size orientation, and translation. Using this process, a dataset of any language can be constructed. CNNs in different architectures have been implemented and validation accuracy of up to 99% has been achieved.

机器学习问题中的最大挑战在于挑选恰当的技术与资源,诸如工具与数据集。尽管全球范围内存在数百万的说话者,以及超过一千年的丰富文学历史,但寻找与旁遮普语沙姆库希脚本相关的计算语言学工作却显得尤为昂贵,该脚本属于波斯-阿拉伯语境特定脚本低资源语言家族。针对机器学习问题的最佳算法选择,在很大程度上依赖于特定任务的可用数据集。本团队呈现了一项新颖的、定制的、首次推出的旁遮普语沙姆库希脚本数据集,并详细阐述了其设计、开发与验证过程,其中采用了人工神经网络技术。该数据集涵盖多达40个类别,包含多种字体,如纳斯提利克、纳斯赫和阿拉伯字体等,以及多种字号,并在多个子尺寸中呈现。数据集的设计通过一种特殊的构建流程,使得研究者可根据自身需求对数据集进行修改。数据集构建程序还能够执行数据增强,为机器学习算法生成数百万张具有不同参数(包括字体类型、尺寸、方向和翻译)的图像。通过此流程,可以构建任何语言的数据集。不同架构的卷积神经网络(CNNs)已得到实施,并实现了高达99%的验证准确率。
提供机构:
ieee-dataport.org
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作