rajeshradhakrishnan/malayalam_news

Name: rajeshradhakrishnan/malayalam_news
Creator: rajeshradhakrishnan
Published: 2022-07-04 05:57:19
License: 暂无描述

Hugging Face2022-07-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rajeshradhakrishnan/malayalam_news

下载链接

链接失效反馈

官方服务：

资源简介：

IndicNLP新闻文章分类数据集是基于IndicNLP文本语料库创建的，涵盖了9种语言的新闻文章及其类别。该数据集在类别上是平衡的，每种语言的类别和每类文章的数量如下：孟加拉语（娱乐、体育，每类7K篇）、古吉拉特语（商业、娱乐、体育，每类680篇）、卡纳达语（娱乐、生活方式、体育，每类10K篇）、马拉雅拉姆语（商业、娱乐、体育、技术，每类1.5K篇）、马拉地语（娱乐、生活方式、体育，每类1.5K篇）、奥里亚语（商业、犯罪、娱乐、体育，每类7.5K篇）、旁遮普语（商业、娱乐、体育、政治，每类780篇）、泰米尔语（娱乐、政治、体育，每类3.9K篇）、泰卢固语（娱乐、商业、体育，每类8K篇）。

The IndicNLP News Article Classification Dataset is constructed upon the IndicNLP text corpus, encompassing news articles and their categorical labels across 9 languages. The dataset exhibits category balance, with the category sets and per-category article volumes for each language detailed below: Bengali features two categories, Entertainment and Sports, with 7,000 articles per category; Gujarati includes Business, Entertainment and Sports, with 680 articles per category; Kannada covers Entertainment, Lifestyle and Sports, with 10,000 articles per category; Malayalam has Business, Entertainment, Sports and Technology, with 1,500 articles per category; Marathi contains Entertainment, Lifestyle and Sports, with 1,500 articles per category; Odia consists of Business, Crime, Entertainment and Sports, with 7,500 articles per category; Punjabi includes Business, Entertainment, Sports and Politics, with 780 articles per category; Tamil covers Entertainment, Politics and Sports, with 3,900 articles per category; Telugu features Entertainment, Business and Sports, with 8,000 articles per category.

提供机构：

rajeshradhakrishnan

原始信息汇总

IndicNLP News Article Classification Dataset 概述

数据集描述

语言数量：9种语言
数据集平衡性：各语言下的类别分布均衡

数据集统计

语言	类别	每类文章数量
孟加拉语	娱乐, 体育	7,000
古吉拉特语	商业, 娱乐, 体育	680
卡纳达语	娱乐, 生活方式, 体育	10,000
马拉雅拉姆语	商业, 娱乐, 体育, 技术	1,500
马拉地语	娱乐, 生活方式, 体育	1,500
奥里亚语	商业, 犯罪, 娱乐, 体育	7,500
旁遮普语	商业, 娱乐, 体育, 政治	780
泰米尔语	娱乐, 政治, 体育	3,900
泰卢固语	娱乐, 商业, 体育	8,000

引用信息

引用文献：AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages
作者：Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
年份：2020
期刊：arXiv preprint arXiv:2005.00085

5,000+

优质数据集

54 个

任务类型

进入经典数据集