Dhivehi News Categories Dataset

Mendeley Data2026-04-09 收录

下载链接：

https://data.mendeley.com/datasets/m397m9n99v

下载链接

链接失效反馈

官方服务：

资源简介：

The Dhivehi News Categories Dataset addresses the lack of publicly available resources for Dhivehi, a low-resource language spoken in the Maldives, enabling machine learning (ML) algorithms like k-Nearest Neighbors, Decision Trees, XGBoost, SVM, Naïve Bayes, Random Forest, and Artificial Neural Networks to process Dhivehi text for tasks such as text classification and language modeling. Comprising 6,000 curated news articles from reputable sources (e.g., Sunmv, Haveeru, Raajje.mv), the dataset is balanced across four categories: Business, Sports, Entertainment, and World News, with 1,500 articles each. Articles were collected using Python-based web scraping tools, cleaned to remove duplicates and irrelevant content, and manually categorized for high-quality structured data. It supports NLP tasks like text classification, sentiment analysis, and topic modeling, offering balanced representation, thematic clarity (e.g., by optionally excluding "World News"), and fostering low-resource language research. Stored in UTF-8 for compatibility, it contributes to linguistic, cultural, and media studies while advancing AI and multilingual NLP applications. This pioneering Dhivehi resource enables comparative cross-linguistic studies, innovation in computational linguistics, and linguistic inclusivity, ensuring underrepresented languages like Dhivehi are included in global AI advancements.

5,000+

优质数据集

54 个

任务类型

进入经典数据集