Mukhyansh

Name: Mukhyansh
Creator: 语言技术研究中心，印度信息技术学院海得拉巴分校
Published: 2023-11-29 23:49:24
License: 暂无描述

arXiv2023-11-29 更新2024-06-21 收录

下载链接：

https://github.com/ltrc/Mukhyansh

下载链接

链接失效反馈

官方服务：

资源简介：

Mukhyansh数据集是由印度信息技术学院海得拉巴分校的语言技术研究中心创建的，旨在解决印度语言在自然语言处理领域中数据稀缺的问题。该数据集包含超过339万条新闻文章与相应标题的配对，覆盖了泰卢固语、泰米尔语、卡纳达语、马拉雅拉姆语、印地语、孟加拉语、马拉地语和古吉拉特语等8种印度主要语言。数据收集过程中，研究团队开发了针对特定新闻网站的爬虫，确保数据的高质量和多样性。该数据集主要用于训练和评估印度语言的标题生成模型，以推动低资源语言处理的研究。

The Mukhyansh dataset was created by the Language Technology Research Center at the Indian Institute of Information Technology, Hyderabad, with the aim of addressing the data scarcity issue faced by Indian languages in the field of natural language processing (NLP). This dataset contains over 3.39 million pairs of news articles and their corresponding headlines, covering 8 major Indian languages including Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. During the data collection phase, the research team developed crawlers targeting specific news websites to ensure the high quality and diversity of the dataset. This dataset is primarily utilized for training and evaluating headline generation models for Indian languages, thereby advancing research on low-resource language processing.

提供机构：

语言技术研究中心，印度信息技术学院海得拉巴分校

创建时间：

2023-11-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集