SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization

Name: SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization
Creator: Mendeley
Published: 2025-04-01 06:40:01
License: 暂无描述

DataCite Commons2025-04-01 更新2025-04-16 收录

下载链接：

https://data.mendeley.com/datasets/57zpx667y9

下载链接

链接失效反馈

官方服务：

资源简介：

SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona. All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesn’t have [Religion]. SANAD contains a total number of 190k+ articles. How to use it: ___________ 1. Unzip compressed resources. 2. Each folder contains 6-7 sub-folders which are labeled by the category's name. 3. Each sub-folder contains a set of article files corresponding to its category. SANAD_SUBSET is a balanced benchmark dataset (from SANAD) that is used in our research work. It contains the training (90%) and testing (10%) sets. How to use it: ___________ 1. Unzip the compressed file. 2. There are 3 main folders containing the 3 datasets: Akhbarona, Khaleej, and Arabiya. 3. Each dataset-folder contains 2 sub-folders: training and testing. 4. The training and testing folders include the balanced categories sub-folders.

提供机构：

Mendeley

创建时间：

2019-03-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集