rajeshradhakrishnan/malayalam_news
收藏Hugging Face2022-07-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rajeshradhakrishnan/malayalam_news
下载链接
链接失效反馈官方服务:
资源简介:
IndicNLP新闻文章分类数据集是基于IndicNLP文本语料库创建的,涵盖了9种语言的新闻文章及其类别。该数据集在类别上是平衡的,每种语言的类别和每类文章的数量如下:孟加拉语(娱乐、体育,每类7K篇)、古吉拉特语(商业、娱乐、体育,每类680篇)、卡纳达语(娱乐、生活方式、体育,每类10K篇)、马拉雅拉姆语(商业、娱乐、体育、技术,每类1.5K篇)、马拉地语(娱乐、生活方式、体育,每类1.5K篇)、奥里亚语(商业、犯罪、娱乐、体育,每类7.5K篇)、旁遮普语(商业、娱乐、体育、政治,每类780篇)、泰米尔语(娱乐、政治、体育,每类3.9K篇)、泰卢固语(娱乐、商业、体育,每类8K篇)。
The IndicNLP News Article Classification Dataset is constructed upon the IndicNLP text corpus, encompassing news articles and their categorical labels across 9 languages. The dataset exhibits category balance, with the category sets and per-category article volumes for each language detailed below: Bengali features two categories, Entertainment and Sports, with 7,000 articles per category; Gujarati includes Business, Entertainment and Sports, with 680 articles per category; Kannada covers Entertainment, Lifestyle and Sports, with 10,000 articles per category; Malayalam has Business, Entertainment, Sports and Technology, with 1,500 articles per category; Marathi contains Entertainment, Lifestyle and Sports, with 1,500 articles per category; Odia consists of Business, Crime, Entertainment and Sports, with 7,500 articles per category; Punjabi includes Business, Entertainment, Sports and Politics, with 780 articles per category; Tamil covers Entertainment, Politics and Sports, with 3,900 articles per category; Telugu features Entertainment, Business and Sports, with 8,000 articles per category.
提供机构:
rajeshradhakrishnan
原始信息汇总
IndicNLP News Article Classification Dataset 概述
数据集描述
- 语言数量:9种语言
- 数据集平衡性:各语言下的类别分布均衡
数据集统计
| 语言 | 类别 | 每类文章数量 |
|---|---|---|
| 孟加拉语 | 娱乐, 体育 | 7,000 |
| 古吉拉特语 | 商业, 娱乐, 体育 | 680 |
| 卡纳达语 | 娱乐, 生活方式, 体育 | 10,000 |
| 马拉雅拉姆语 | 商业, 娱乐, 体育, 技术 | 1,500 |
| 马拉地语 | 娱乐, 生活方式, 体育 | 1,500 |
| 奥里亚语 | 商业, 犯罪, 娱乐, 体育 | 7,500 |
| 旁遮普语 | 商业, 娱乐, 体育, 政治 | 780 |
| 泰米尔语 | 娱乐, 政治, 体育 | 3,900 |
| 泰卢固语 | 娱乐, 商业, 体育 | 8,000 |
引用信息
- 引用文献:AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages
- 作者:Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
- 年份:2020
- 期刊:arXiv preprint arXiv:2005.00085



