five

Urdu-Nepali Parallel Corpus

收藏
SSH Open MarketPlace2024-09-30 更新2024-10-05 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/Q170XP
下载链接
链接失效反馈
官方服务:
资源简介:
Pakistan has a rich multilingual and multicultural heritage, with about 70 spoken languages, deriving from a diverse set of Indo-Aryan, Indo-Iranian, Sino-Tibetan and Dravidian language families. More than half of these languages also have a written form, employing (predominantly) Perso-Arabic Nastalique and Arabic Naskh writing styles. Gujarati, Gurmuki and Tibetan scripts are also used by some communities, while some others are in the process of defining their writing systems. These languages exhibit a diverse set of sounds and underlying linguistic structures which are both linguistically and computationally exciting and challenging. Most of these languages are not well-studied or well-modeled, and present a vast training ground for researchers in linguistics and computer science. This dataset provides resources for two languages spoken in Pakistan: Nepali and Urdu. Urdu is the national language of Pakistan, while Nepali is mainly spoken in a small immigrant community. This corpus is made of two documents, one in Nepali and one in Urdu. Each document is available with and without part of speech tags. They are parallel to the 100,000 words of common English source from PENN Treebank corpus, available through Linguistic Data Consortium (LDC). The part of speech tags are those in the Penn Treebank, and additional information can be found in the included .csv file.
创建时间:
2024-09-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作