five

Aksharantar

收藏
arXiv2023-10-26 更新2024-06-21 收录
下载链接:
https://github.com/AI4Bharat/IndicXlit
下载链接
链接失效反馈
官方服务:
资源简介:
Aksharantar是由印度理工学院马德拉斯分校创建的公开印度语言音译数据集,包含2600万音译对,涵盖21种印度语言,使用12种不同的文字。该数据集通过挖掘单语和并行语料库以及人工标注数据创建,是现有数据集的21倍大,并且首次公开了7种语言和1种语言家族的数据。Aksharantar数据集旨在解决印度语言音译的挑战,支持下游应用,如输入工具的开发,通过提供大规模、开放的资源来促进印度语言音译的创新。

Aksharantar is an open transliteration dataset for Indian languages created by the Indian Institute of Technology Madras. It contains 26 million transliteration pairs, covering 21 Indian languages and using 12 distinct writing systems. This dataset was developed by mining monolingual and parallel corpora as well as manually annotated data, and is 21 times larger than existing datasets. For the first time, it publicly releases data for 7 individual languages and 1 language family. The Aksharantar dataset aims to address the challenges of Indian language transliteration, support downstream applications such as the development of input tools, and promote innovation in Indian language transliteration by providing large-scale, open resources.
提供机构:
印度理工学院马德拉斯分校
创建时间:
2022-05-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作