five

KDC-4007文本数据集(体育、宗教、艺术、经济、教育、社会、风格和健康)

收藏
帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-26123.html
下载链接
链接失效反馈
官方服务:
资源简介:
Data Set Information: 该数据集最重要的特点是其使用简单且有良好的文档记录,可广泛用于有关库尔德索拉尼新闻和文章的各种文本分析研究。 这些文件包括八个类别,即体育、宗教、艺术、经济、教育、社会、风格和健康。每个文档由500个文本文档组成,其中语料库的总大小为4007个文本文件。 数据集和文档可以自由访问,以便获得可重复的实验评估结果。 Attribute Information: There is four collection: - ST-Ds datasets, just stop words elimination is performed by using Kurdish preprocessing-step approach. - The pre-ds dataset, Kurdish preprocessing-step approach is used. - The Pre+TW-Ds dataset, TF?—IDF term weighting on the Pre-Ds dataset is performed. - Orig-Ds datasets, no process is used which is the original dataset. Relevant Papers: [1] Arazo M. Mustafa and Tarik A. Rashid,a€? Kurdish Stemmer Pre-processing Steps for Improving Information Retrievala€?, Journal of Information Science, First published date: january-01-2017, 10.1177/0165551516683617. [2] Tarik A. Rashid, Arazo M. Mustafa and Ari M. Saeed, 2017.'A Robust Categorization System for Kurdish Sorani Text documents'. Information Technology Journal, 16: 27-34. [3] Tarik A. Rashid, Arazu M. Mustafa, Ari M. Saeed Automatic Kurdish Text Classification Using KDC 4007 Dataset, accepted in Springer book, Series Title: Lecture Notes on Data Engineering and Communications Technologies: Book title: Advances in Internetworking, Data & Web Technologies, Indexing: The books of this series are submitted to ISI Proceedings, EI, Scopus, metaPress, Springerlink, 2017. Citation Request: If you have no special citation requests, please leave this field blank. Arazo M. Mustafa, (arazo.2007 '@' yahoo.com), School of Computer Science University of Sulaimania, Kurdistan, Iraq
提供机构:
帕依提提
二维码
社区交流群
二维码
科研交流群
商业服务