SLPG/Punjabi_Transliteration_Corpus

Name: SLPG/Punjabi_Transliteration_Corpus
Creator: SLPG
Published: 2024-07-20 09:01:17
License: 暂无描述

Hugging Face2024-07-20 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/SLPG/Punjabi_Transliteration_Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

Punjabi Transliteration Corpus (PTC) 是一个全面的数据集，包含6.3百万对Gurmukhi和Shahmukhi脚本的平行句子。该数据集精心编纂，旨在支持旁遮普语文本神经机器转写模型的开发和评估。涵盖多个领域，包括CCaligned、ccmatrix、TED、QED、OPUS、TIco、Wikimedia、Multicclaigned、Emille、IJCNLP、xlent和paracrawl。测试集为FLORES-101。模型性能方面，Gurmukhi-to-Shahmukhi模型的BLEU分数为98.1，词级准确率为99.5%，字符错误率为99.1%；Shahmukhi-to-Gurmukhi模型的BLEU分数为87.7。

The Punjabi Transliteration Corpus (PTC) is a comprehensive dataset containing 6.3 million parallel sentences in Gurmukhi and Shahmukhi scripts. This corpus has been meticulously compiled to support the development and evaluation of neural machine transliteration (NMT) models for Punjabi text. It covers various domains including CCaligned, ccmatrix, TED, QED, OPUS, TIco, Wikimedia, Multicclaigned, Emille, IJCNLP, xlent, and paracrawl. The test corpus is FLORES-101. The Gurmukhi-to-Shahmukhi model has a BLEU score of 98.1, word-level accuracy of 99.5%, and character error rate of 99.1%; the Shahmukhi-to-Gurmukhi model has a BLEU score of 87.7.

提供机构：

SLPG

原始信息汇总

Punjabi Transliteration Corpus (PTC)

概述

Punjabi Transliteration Corpus (PTC) 是一个包含630万对平行句子的综合数据集，涵盖Gurmukhi和Shahmukhi两种脚本。该数据集旨在支持开发和评估用于旁遮普语文本的神经机器转写（NMT）模型。

数据集详情

总句子数: 630万
涵盖领域: 包括CCaligned、ccmatrix、TED、QED、OPUS、TIco、Wikimedia、Multicclaigned、Emille、IJCNLP、xlent和paracrawl等多个领域。
测试语料库: FLORES-101

模型

Gurmukhi-to-Shahmukhi 模型

BLEU 分数: 98.1
词级准确率: 99.5%
字符错误率 (CER): 99.1%

Shahmukhi-to-Gurmukhi 模型

BLEU 分数: 87.7

用途

该资源旨在促进旁遮普语转写领域的研究和开发。可用于训练新模型或改进现有模型，实现Gurmukhi和Shahmukhi脚本之间的高质量转写。

5,000+

优质数据集

54 个

任务类型

进入经典数据集