IndicCorp
收藏arXiv2023-05-25 更新2024-06-21 收录
下载链接:
https://ai4bharat.iitm.ac.in/language-understanding
下载链接
链接失效反馈官方服务:
资源简介:
IndicCorp是由印度理工学院马德拉斯分校和AI4Bharat合作创建的针对印度语言的最大单语语料库。该数据集包含209亿个令牌,覆盖24种语言,支持12种额外的语言,是之前工作的2.3倍增长。IndicCorp通过从人类验证的URL中爬取内容,确保数据的质量和相关性。该数据集主要用于提升印度语言的自然语言理解能力,特别是在多语言预训练语言模型中,旨在解决资源较少语言的性能问题。
IndicCorp is the largest monolingual corpus for Indian languages, created via a collaborative effort between the Indian Institute of Technology Madras and AI4Bharat. This dataset contains 20.9 billion tokens, covers 24 languages and supports 12 additional languages, representing a 2.3-fold growth over prior work. IndicCorp ensures data quality and relevance by crawling content from human-validated URLs. It is primarily used to enhance natural language understanding capabilities for Indian languages, especially in multilingual pre-trained language models, with the aim of resolving performance issues for low-resource languages.
提供机构:
印度理工学院马德拉斯分校
创建时间:
2022-12-11



