IndicNLP Corpus
收藏arXiv2020-05-01 更新2024-06-21 收录
下载链接:
https://github.com/ai4bharatindicnlp/indicnlp_corpus
下载链接
链接失效反馈官方服务:
资源简介:
IndicNLP Corpus是由微软印度、IIT Madras和AI4Bharat共同创建的大型通用领域语料库,包含10种印度语言的27亿词。该数据集主要来源于新闻网站和维基百科,通过网络爬虫技术收集并经过文本处理和去重处理。IndicNLP Corpus不仅用于训练词嵌入模型,还用于评估多种NLP任务,如文本分类、情感分析等,旨在推动印度语言的NLP研究。
The IndicNLP Corpus is a large general-domain corpus co-created by Microsoft India, IIT Madras and AI4Bharat, containing 2.7 billion words across 10 Indian languages. Primarily sourced from news websites and Wikipedia, this dataset was collected via web crawling technologies and subsequently subjected to text processing and deduplication. It is not only used for training word embedding models, but also for evaluating various NLP tasks such as text classification and sentiment analysis, with the goal of advancing NLP research focused on Indian languages.
提供机构:
微软印度
创建时间:
2020-05-01



