A Large-scale Arabic Articles Dataset for Multi-Group Text Classification
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/8f5rjtt4w7
下载链接
链接失效反馈官方服务:
资源简介:
Dataset Title: A Large-scale Hierarchical Arabic Articles Dataset for Multi-Group Classification
1. General Information
• Dataset Name: A Large-scale Hierarchical Arabic Articles Dataset for Multi-Group Text Classification.
• Principal Investigator: Suliman Al-Shossi.
• Co-Investigator: Akram Alsubari.
• Institution: Faculty of Science, University of Ibb, Yemen.
• Associated Publication: Al-Shossi, S., & Alsubari, A. (2026). Multi Group classification based on Arabic. International Journal on Advanced Electrical and Computer Engineering.
2. Dataset Overview
This dataset contains 75,282 Arabic articles meticulously curated and preprocessed for Natural Language Processing (NLP) tasks. The articles are organized in a hierarchical structure featuring 29 consolidated main categories and 600 distinct subcategories.
3. Data Source and Collection
The data was collected using automated web scraping techniques from four major Arabic content platforms:
• Mqall
• Mawdoo3
• Mhtwyat
• Mqalaty The initial raw collection of 94,685 articles was refined through manual label normalization and down-sampling to ensure class balance.
4. File Description
• Arabic_Articles_Dataset.csv: The main dataset file. It includes the following columns:
o Title: The original title of the article.
o Content: The cleaned and normalized textual content of the article.
o Main_Category: The top-level category label.
o Sub_Category: The specific sub-level category label.
• Network_Graph.png: A visualization (Fig. 2) showing the hierarchical relationships and semantic clusters between categories.
5. Preprocessing & Normalization
The textual content has been standardized using the following steps:
• Text Cleaning: Removal of Arabic stopwords, diacritics, punctuation, and non-Arabic characters.
• Normalization: Standardizing Alif forms (أ, إ, آ to ا) and converting Ta Marbuta (ة) to Ha (ه).
• Class Balancing: Each subcategory is capped at 400 articles to minimize model bias.
6. Usage and Reuse Potential
This dataset is designed for:
• Training and evaluating Hierarchical Text Classification models.
• Benchmarking Generative (e.g., AraT5) and Direct (e.g., AraGPT-2) classification approaches.
• Developing Arabic-specific topic modeling and semantic analysis tools.
7. Ethical Considerations
The data consists of publicly available content collected in accordance with the terms of use of the source websites. All participant-related data has been fully anonymized.
创建时间:
2026-02-10



