NADCG :New Arabic dataset for text classification and generation
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/mrh6fy2dkj
下载链接
链接失效反馈官方服务:
资源简介:
- NADCG
New Arabic dataset for text classification and generation.
-NADCG
2,136,311 Rows.
-NADCG is a large collection of Arabic news headline, category and articles that can been used in several NLP tasks.
-NADCG tasks
Text generation, text classification, summarization and producing word-embedding.
-NADCG fields
Headline, summary, article, and category.
- NADCG is larger than other data sets, as its size is 2,136,311 classified news items, in UTF-8 encoding and CSV format.
- NADCG is contains vast number of Arabic news have eight categories (Politics, Economics, Sports, Health, Technology, Culture, Arts, Accidents), in general, the corpus adopted the labeling of each article as appeared in its news portal source.
In summary, NADCG's large size and variety of fields make it stand out from the crowd, so it can be used for many tasks and also for training large transformer models, and it is also available for free.
- NADCG_SUBSET is a balanced benchmark dataset (from NADCG) that is used in our research work (80K). It contains the training (90%), validation (5%) and testing (5%) sets.
Training set size: 72000 row, Validation set size: 4000 row, and Testing set size: 4000 row.
创建时间:
2024-09-05



