five

NADiA: News Articles Dataset in Arabic for Multi-Label Text Categorization

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/hhrb7phdyx
下载链接
链接失效反馈
官方服务:
资源简介:
NADiA Dataset is the largest, to the best of our knowledge, source for Arabic textual data that can be used in any NLP related task such as text classification. We chose the abbreviation NADiA as it is a common Arabic name. The data was collected by scraping ‘SkyNewsArabia’ and ‘Masrawy’ news websites using Python scripts that are fine-tuned for each website. SkyNewsArabia will be referred to as NADiA1, while the latter would be NADiA2. NADiA1 is a big dataset containing 37,445 files, while NADiA2 is a huge dataset that contains 678,563 files. However, after filtering and cleaning we reduced the numbers to 35,416 and 451,230 for NADiA 1 and 2, respectively. NADiA1 consists of the following categories (24, displayed in English for easy referencing): News, North Africa, Levant, Middle East, The Americas, Research, Finance & Economy, War & Terrorism, Gulf, Europe, Political Figures, Iran, Technology, Russia, Sports, Tennis, Football, English League, Arabian Sports, Spanish League, Health, East Asia, Environment, Other Countries NADiA2 consists of the following categories (28, displayed in English for easy referencing): Politics, Middle East, Asia, Africa, United States, Europe, Other Countries, Leaders, Sports, Arabian Sports, Football Clubs, Spanish League, Egyptian League, Finance, Arts, Cinema & TV, Fashion, Health, Pregnancy & Delivery, Cancer, Obesity, Social Media, Technology, Religion, Islamic, Fatawa, Worship, Prophet Biography
创建时间:
2019-09-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作