five

Ultimate Arabic News Dataset

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/jz56k5wxz7
下载链接
链接失效反馈
官方服务:
资源简介:
The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles. Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources. - The data we collect consists of two Primary files: UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification. UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification. - We have added two folders containing additional detailed datasets: 1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets: Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website. 2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases. - The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.

终极阿拉伯语新闻数据集(The Ultimate Arabic News Dataset)是一套用于新闻网站与报刊文章的单标签现代阿拉伯语文本集合。 该阿拉伯语新闻数据通过网络爬虫技术,从阿拉伯电视台(Al-Arabiya)、今日七号报(Al-Youm Al-Sabea,即Youm7)、谷歌搜索引擎发布的新闻及其他各类来源等多家知名新闻站点采集得到。 本次采集的数据包含两个核心文件: 1. UltimateArabic:该文件包含超过19.3万条未经过预处理的原始阿拉伯语新闻文本,文本中包含词汇、数字与符号,可通过预处理步骤移除这些内容,以提升该数据集在各类阿拉伯语自然语言处理任务(如文本分类)中的应用精度。 2. UltimateArabicPrePros:该文件包含上述第一个文件中的全部数据,但已完成预处理,最终得到约18.8万条文本文档,已移除停用词、非阿拉伯语词汇、符号及数字,可直接用于各类阿拉伯语自然语言处理任务,例如文本分类。 我们额外增设了两个包含详细子数据集的文件夹: 1. 样本(Sample)文件夹:内含两个热门阿拉伯新闻网站在体育、政治两个不同新闻类别的爬虫结果样本。该文件夹包含两个数据集: - Sample_Youm7_Politic:从Youm7网站采集的“政治”类新闻样本。 - Sample_alarabiya_Sport:从阿拉伯电视台网站采集的“体育”类新闻样本。 2. 数据集版本(Dataset Versions)文件夹:包含原始数据集的四种不同版本,可根据文本分类技术的需求选择适配版本: - 第一版(原始版,Original):未经过任何预处理的原生数据,因此该数据集的Token(Token)数量极高。 - 第二版(无停用词版,Original_without_Stop):已完成数据清洗,移除了符号、数字、非阿拉伯语词汇及停用词,符号占比大幅降低。 - 第三版(词干版,Original_with_Stem):已完成数据清洗,并采用文本词干提取技术移除所有可能影响结果精度的词缀与后缀,以获取词干。 - 第四版(无停用词+词干版,Original_Without_Stop_Stem):已应用所有预处理技术,包括数据清洗、停用词移除及文本词干提取,因此该版本的Token数量为所有发布版本中最低。 该数据集共分为10个类别:文化、综合、经济、体育、政治、艺术、社会、科技、医疗与宗教。
创建时间:
2022-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作