ARPD: The Academic Arabic Research Papers Dataset (corpus).
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10659156
下载链接
链接失效反馈官方服务:
资源简介:
ARPD: The Academic Arabic Research Papers Dataset.
This corpus/dataset contributes to the Arabic language field by providing a novel Academic dataset that can be utilized for various purposes, such as NLP models and conducting text analysis. Also, this dataset contains papers from seven science fields written in Arabic. These fields are Arabic, religion, art, law, education, and agriculture. So, the dataset consists of seven classes based on these science fields. In addition, this dataset boasts a significant number of 2,133 documents. Also, the dataset was published in different formats, such as PDF files, Text files, and CSV files, to benefit the Arabic research area.
Arabic Article class
# papers
Art
303
Law
301
Business
313
Religion
304
Agricultural
308
Arabic
303
Education
301
Total
2133
The dataset contains:
1. All the PDF files are classified into appropriate classes. (PDF.zip)
2. The text files obtained after converting the PDF and processing. (Text file.zip)
3. The text file is a new version after applying some processing steps. The preprocessing step includes Arabic normalization for alef, teh, and ligature, removes tashkeel, harakat, tweel, and shadda, and stops word removal. (TextWithPreprocess.zip)
4. The CSV file contains two columns: the first column of the paper and the second column, which includes the class. (ARPD.csv , ARPD-prepocess.csv)
创建时间:
2025-03-24



