SLPG/Punjabi_Transliteration_Corpus
收藏Hugging Face2024-07-20 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/SLPG/Punjabi_Transliteration_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
Punjabi Transliteration Corpus (PTC) 是一个全面的数据集,包含6.3百万对Gurmukhi和Shahmukhi脚本的平行句子。该数据集精心编纂,旨在支持旁遮普语文本神经机器转写模型的开发和评估。涵盖多个领域,包括CCaligned、ccmatrix、TED、QED、OPUS、TIco、Wikimedia、Multicclaigned、Emille、IJCNLP、xlent和paracrawl。测试集为FLORES-101。模型性能方面,Gurmukhi-to-Shahmukhi模型的BLEU分数为98.1,词级准确率为99.5%,字符错误率为99.1%;Shahmukhi-to-Gurmukhi模型的BLEU分数为87.7。
The Punjabi Transliteration Corpus (PTC) is a comprehensive dataset containing 6.3 million parallel sentences in Gurmukhi and Shahmukhi scripts. This corpus has been meticulously compiled to support the development and evaluation of neural machine transliteration (NMT) models for Punjabi text. It covers various domains including CCaligned, ccmatrix, TED, QED, OPUS, TIco, Wikimedia, Multicclaigned, Emille, IJCNLP, xlent, and paracrawl. The test corpus is FLORES-101. The Gurmukhi-to-Shahmukhi model has a BLEU score of 98.1, word-level accuracy of 99.5%, and character error rate of 99.1%; the Shahmukhi-to-Gurmukhi model has a BLEU score of 87.7.
提供机构:
SLPG
原始信息汇总
Punjabi Transliteration Corpus (PTC)
概述
Punjabi Transliteration Corpus (PTC) 是一个包含630万对平行句子的综合数据集,涵盖Gurmukhi和Shahmukhi两种脚本。该数据集旨在支持开发和评估用于旁遮普语文本的神经机器转写(NMT)模型。
数据集详情
- 总句子数: 630万
- 涵盖领域: 包括CCaligned、ccmatrix、TED、QED、OPUS、TIco、Wikimedia、Multicclaigned、Emille、IJCNLP、xlent和paracrawl等多个领域。
- 测试语料库: FLORES-101
模型
Gurmukhi-to-Shahmukhi 模型
- BLEU 分数: 98.1
- 词级准确率: 99.5%
- 字符错误率 (CER): 99.1%
Shahmukhi-to-Gurmukhi 模型
- BLEU 分数: 87.7
用途
该资源旨在促进旁遮普语转写领域的研究和开发。可用于训练新模型或改进现有模型,实现Gurmukhi和Shahmukhi脚本之间的高质量转写。



