Qasim522/Roman-Urdu-Parl-split
收藏Hugging Face2025-12-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Qasim522/Roman-Urdu-Parl-split
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为Roman Urdu Parallel Dataset - Split,是原始Roman-Urdu-Parl数据集的结构化版本,专门用于乌尔都语和罗马化乌尔都语之间的机器音译任务。原始数据集包含6,365,808对平行句子,其中同一乌尔都语句子存在多种罗马化乌尔都语音译变体。该数据集通过确保训练集、验证集和测试集之间无重叠,解决了数据泄漏问题,从而促进了音译模型的泛化能力。数据集拆分策略包括为验证集和测试集选择唯一句子(仅有一种变体)以及包含2-10种变体的句子,其余句子及其变体则纳入训练集。此外,还提供了较小的验证和测试子集以加速模型开发期间的评估。
This dataset, named Roman Urdu Parallel Dataset - Split, is a structured version of the original Roman-Urdu-Parl dataset, specifically designed for machine transliteration tasks between Urdu and Roman-Urdu. The original dataset consists of 6,365,808 parallel sentences, featuring multiple Roman-Urdu transliteration variations for the same Urdu sentence. The split version addresses data leakage by ensuring non-overlapping train, validation, and test sets, thereby promoting generalization in transliteration models. The splitting strategy involves selecting unique sentences (with only one variation) for the validation and test sets, as well as sentences with 2-10 variations, while the remaining sentences and their variations are included in the training set. Smaller subsets are also provided for efficient evaluation during model development.
提供机构:
Qasim522



