Kannada Abstractive Text Summarization
收藏Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/pfx79p84cj
下载链接
链接失效反馈官方服务:
资源简介:
KannadaSum-10K: A 10,000-Sample Dataset for Kannada Abstractive Text Summarization is a Kannada-language natural language processing dataset developed for research on abstractive text summarization. The dataset contains 10,000 article–summary pairs, where each sample consists of a Kannada text article and its corresponding reference summary. It is designed to support the development, training, fine-tuning, and evaluation of machine learning and deep learning models for Kannada text summarization.
The dataset is organized into two main fields: article and Reference Summary. The article field contains the source Kannada text, typically written in a news-style or informative prose format. The Reference Summary field contains a concise Kannada summary that captures the central idea of the article. This structure makes the dataset suitable for supervised abstractive summarization, where a model learns to generate meaningful summaries rather than simply extracting sentences from the original text.
This dataset can be used for multiple research purposes, including Kannada abstractive summarization, low-resource language modeling, Indic NLP research, text generation, sequence-to-sequence learning, transformer-based model fine-tuning, and comparative evaluation of multilingual summarization models. It may be particularly useful for training models such as mT5, IndicBART, mBART, ByT5, MuRIL-based encoder-decoder systems, and other transformer architectures adapted for Indian languages.
The primary objective of KannadaSum-10K is to contribute a reusable Kannada summarization resource to the NLP research community. By providing article and reference-summary pairs in Kannada, the dataset aims to support improved summarization systems for regional-language digital content, news articles, educational material, and information-access applications. The dataset may also help researchers study the challenges of Kannada text generation, including morphology, sentence structure, semantic compression, and content selection.
Before using the dataset for benchmarking, users should perform appropriate preprocessing, quality checking, train–validation–test splitting, and duplicate removal if required. Proper citation of the dataset is requested when it is used in academic publications, experiments, or software systems.
创建时间:
2026-05-15



