IndicCorp

Name: IndicCorp
Creator: 印度理工学院马德拉斯分校
Published: 2023-05-25 01:05:16
License: 暂无描述

arXiv2023-05-25 更新2024-06-21 收录

下载链接：

https://ai4bharat.iitm.ac.in/language-understanding

下载链接

链接失效反馈

官方服务：

资源简介：

IndicCorp是由印度理工学院马德拉斯分校和AI4Bharat合作创建的针对印度语言的最大单语语料库。该数据集包含209亿个令牌，覆盖24种语言，支持12种额外的语言，是之前工作的2.3倍增长。IndicCorp通过从人类验证的URL中爬取内容，确保数据的质量和相关性。该数据集主要用于提升印度语言的自然语言理解能力，特别是在多语言预训练语言模型中，旨在解决资源较少语言的性能问题。

IndicCorp is the largest monolingual corpus for Indian languages, created via a collaborative effort between the Indian Institute of Technology Madras and AI4Bharat. This dataset contains 20.9 billion tokens, covers 24 languages and supports 12 additional languages, representing a 2.3-fold growth over prior work. IndicCorp ensures data quality and relevance by crawling content from human-validated URLs. It is primarily used to enhance natural language understanding capabilities for Indian languages, especially in multilingual pre-trained language models, with the aim of resolving performance issues for low-resource languages.

提供机构：

印度理工学院马德拉斯分校

创建时间：

2022-12-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集