TTC-3600：土耳其文本分类数据集的基准数据集

Name: TTC-3600：土耳其文本分类数据集的基准数据集
Creator: 帕依提提
License: 暂无描述

帕依提提2024-03-04 收录

下载链接：

https://www.payititi.com/opendatasets/show-26291.html

下载链接

链接失效反馈

官方服务：

资源简介：

Assist.Prof.Dr. Deniz KILIN??, Faculty of Technology, Celal Bayar University, Turkey drdenizkilinc'@'gmail.com Data Set Information: The dataset consists of a total of 3600 documents including 600 news/texts from six categories a€“ economy, culture-arts, health, politics, sports and technology a€“ obtained from six well-known news portals and agencies (Hurriyet,Posta,Iha,HaberTurk,Radikal and Zaman). documents of TTC-3600 dataset were collected between May and July 2015 via Rich Site Summary (RSS) feeds from six categories of the respective portals. All java scripts, HTML tags ( < img> , < a > , < p > , < strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as advertising are removed. Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all versions of datasets, first, removal-based pre-processing, which is explained in Section 3.2 in detail, is used. Then Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions, etc.) in regard to TC are removed from datasets except for the original one. In this study, a semi-automatically constructed stop-words list that contains 147 words is utilized. Attribute Information: ARFF (Attribute-Relation File Format) Weka format Relevant Papers: [Web link] Citation Request: K?±l?±n?§, Deniz, et al. 'TTC-3600: A new benchmark dataset for Turkish text categorization.' Journal of Information Science (2015): 0165551515620551.

提供机构：

帕依提提