IndicNLP Corpus

Name: IndicNLP Corpus
Creator: 微软印度
Published: 2020-05-01 04:21:02
License: 暂无描述

arXiv2020-05-01 更新2024-06-21 收录

下载链接：

https://github.com/ai4bharatindicnlp/indicnlp_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

IndicNLP Corpus是由微软印度、IIT Madras和AI4Bharat共同创建的大型通用领域语料库，包含10种印度语言的27亿词。该数据集主要来源于新闻网站和维基百科，通过网络爬虫技术收集并经过文本处理和去重处理。IndicNLP Corpus不仅用于训练词嵌入模型，还用于评估多种NLP任务，如文本分类、情感分析等，旨在推动印度语言的NLP研究。

The IndicNLP Corpus is a large general-domain corpus co-created by Microsoft India, IIT Madras and AI4Bharat, containing 2.7 billion words across 10 Indian languages. Primarily sourced from news websites and Wikipedia, this dataset was collected via web crawling technologies and subsequently subjected to text processing and deduplication. It is not only used for training word embedding models, but also for evaluating various NLP tasks such as text classification and sentiment analysis, with the goal of advancing NLP research focused on Indian languages.

提供机构：

微软印度

创建时间：

2020-05-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集