Urdu-Nepali Parallel Corpus

SSH Open MarketPlace2025-01-30 更新2025-02-01 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/Q170XP

下载链接

链接失效反馈

官方服务：

资源简介：

Pakistan has a rich multilingual and multicultural heritage, with about 70 spoken languages, deriving from a diverse set of Indo-Aryan, Indo-Iranian, Sino-Tibetan and Dravidian language families. More than half of these languages also have a written form, employing (predominantly) Perso-Arabic Nastalique and Arabic Naskh writing styles. Gujarati, Gurmuki and Tibetan scripts are also used by some communities, while some others are in the process of defining their writing systems. These languages exhibit a diverse set of sounds and underlying linguistic structures which are both linguistically and computationally exciting and challenging. Most of these languages are not well-studied or well-modeled, and present a vast training ground for researchers in linguistics and computer science. This dataset provides resources for two languages spoken in Pakistan: Nepali and Urdu. Urdu is the national language of Pakistan, while Nepali is mainly spoken in a small immigrant community. This corpus is made of two documents, one in Nepali and one in Urdu. Each document is available with and without part of speech tags. They are parallel to the 100,000 words of common English source from PENN Treebank corpus, available through Linguistic Data Consortium (LDC). The part of speech tags are those in the Penn Treebank, and additional information can be found in the included .csv file.

创建时间：

2025-01-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集