PersoArabicLID
收藏arXiv2023-04-04 更新2024-06-21 收录
下载链接:
https://github.com/sinaahmadi/PersoArabicLID
下载链接
链接失效反馈官方服务:
资源简介:
PersoArabicLID数据集是由乔治梅森大学计算机科学系的研究团队开发的,专门用于识别使用Perso-Arabic脚本的语言。该数据集包含10000个句子,涵盖多种语言,如乌尔都语、库尔德语、普什图语、阿塞拜疆土耳其语、信德语和维吾尔语等。创建过程中,研究人员从多个在线资源收集数据,并采用监督技术进行分类。该数据集主要用于解决在资源有限的环境中识别语言的挑战,特别是在双语社区中使用非传统书写方式的情况。
PersoArabicLID dataset was developed by a research team from the Department of Computer Science, George Mason University, specifically for identifying languages written in the Perso-Arabic script. This dataset contains 10,000 sentences covering a variety of languages, including Urdu, Kurdish, Pashto, Azerbaijani Turkish, Sindhi, Uyghur, and others. During its development, the research team collected data from multiple online resources and utilized supervised classification techniques for language categorization. This dataset is primarily designed to address the challenges of language identification in resource-constrained environments, particularly scenarios where non-traditional writing systems are used in bilingual communities.
提供机构:
乔治梅森大学计算机科学系
创建时间:
2023-04-04



