Dhivehi News Categories Dataset
收藏Mendeley Data2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/m397m9n99v
下载链接
链接失效反馈官方服务:
资源简介:
The Dhivehi News Categories Dataset addresses the lack of publicly available resources for Dhivehi, a low-resource language spoken in the Maldives, enabling machine learning (ML) algorithms like k-Nearest Neighbors, Decision Trees, XGBoost, SVM, Naïve Bayes, Random Forest, and Artificial Neural Networks to process Dhivehi text for tasks such as text classification and language modeling. Comprising 6,000 curated news articles from reputable sources (e.g., Sunmv, Haveeru, Raajje.mv), the dataset is balanced across four categories: Business, Sports, Entertainment, and World News, with 1,500 articles each. Articles were collected using Python-based web scraping tools, cleaned to remove duplicates and irrelevant content, and manually categorized for high-quality structured data. It supports NLP tasks like text classification, sentiment analysis, and topic modeling, offering balanced representation, thematic clarity (e.g., by optionally excluding "World News"), and fostering low-resource language research. Stored in UTF-8 for compatibility, it contributes to linguistic, cultural, and media studies while advancing AI and multilingual NLP applications. This pioneering Dhivehi resource enables comparative cross-linguistic studies, innovation in computational linguistics, and linguistic inclusivity, ensuring underrepresented languages like Dhivehi are included in global AI advancements.



