five

Dhivehi News Categories Dataset

收藏
Mendeley Data2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/m397m9n99v
下载链接
链接失效反馈
官方服务:
资源简介:
The Dhivehi News Categories Dataset addresses the lack of publicly available resources for Dhivehi, a low-resource language spoken in the Maldives, enabling machine learning (ML) algorithms like k-Nearest Neighbors, Decision Trees, XGBoost, SVM, Naïve Bayes, Random Forest, and Artificial Neural Networks to process Dhivehi text for tasks such as text classification and language modeling. Comprising 6,000 curated news articles from reputable sources (e.g., Sunmv, Haveeru, Raajje.mv), the dataset is balanced across four categories: Business, Sports, Entertainment, and World News, with 1,500 articles each. Articles were collected using Python-based web scraping tools, cleaned to remove duplicates and irrelevant content, and manually categorized for high-quality structured data. It supports NLP tasks like text classification, sentiment analysis, and topic modeling, offering balanced representation, thematic clarity (e.g., by optionally excluding "World News"), and fostering low-resource language research. Stored in UTF-8 for compatibility, it contributes to linguistic, cultural, and media studies while advancing AI and multilingual NLP applications. This pioneering Dhivehi resource enables comparative cross-linguistic studies, innovation in computational linguistics, and linguistic inclusivity, ensuring underrepresented languages like Dhivehi are included in global AI advancements.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作