TF-IDF weighted bag-of-words preprocessed text documents from Simple English Wikipedia

Name: TF-IDF weighted bag-of-words preprocessed text documents from Simple English Wikipedia
Creator: Gdańsk University of Technology
Published: 2024-06-10 16:10:45
License: 暂无描述

DataCite Commons2024-06-10 更新2024-07-13 收录

下载链接：

https://mostwiedzy.pl/en/open-research-data/tf-idf-weighted-bag-of-words-preprocessed-text-documents-from-simple-english-wikipedia,42511260848405-0

下载链接

链接失效反馈

官方服务：

资源简介：

The SimpleWiki2K-scores dataset contains TF-IDF weighted bag-of-words preprocessed text documents (raw strings are not available) [feature matrix] and their multi-label assignments [label-matrix]. Label scores for each document are also provided for an enhanced multi-label KNN [1] and LEML [2] classifiers. The aim of the dataset is to establish a benchmark for scores thresholding methods that are necessary to obtain multi-label predictions. Original source of data and preprocessing: Simple English Wikipedia (dump from 2012-05-07) is the source of text documents and category assignments. All articles from main categories were taken up to level 5 of category hierarchy. Crucially, all categories with less than 10 articles were removed from the dataset. Then, articles without any assignments were also removed and this category/article removal process was repeated until there were no categories with less than 10 documents and no documents without at least one category assignments. Bag-of-words representation of documents was used with TF-IDF weighting scheme. The dataset is split into train and test parts. Additionally, train part is subdivided into 10 validation folds. All these partitions were obtained using iterative multi-label stratification algorithm [3]. The scores from KNN and LEML classifiers are provided in each validation fold for train data part and validation data part after they were trained in a given data fold. Scores for test parts are also provided after classifiers were trained on the whole training split. Both feature and label matrices, as well as all the scores provided, are python scipy.sparse.csr_matrix matrices saved in a npz format. All these objects can be loaded in code using scipy.sparse.load_npz(fp) method. [1] Han X, Li S, Shen Z. A k-NN method for large scale hierarchical text classification at LSHTC3. In: Proceedings of the 2012 ECML/PKDD Discovery Challenge Workshop on Large-Scale Hierarchical Text Classification, Bristol 2012. [2] Yu HF, Jain P, Kar P, Dhillon I. Large-scale multi-label learning with missing labels. In: International conference on machine learning 2014 Jan 27 (pp. 593-601). PMLR. [3] Sechidis K, Tsoumakas G, Vlahavas I. On the stratification of multi-label data. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22 2011 (pp. 145-158). Springer Berlin Heidelberg. Whole dataset (train and test parts jointly) summary: nDocs = 67505, nLabels = 1849, nFeatures = 97179 Label_mtx: type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 1849), nnz=108076, dtype=int32 Documents per label:in [0.0-1.0): 0 itemsin [1.0-3.0): 0 itemsin [3.0-10.0): 0 itemsin [10.0-30.0): 1363 itemsin [30.0-100.0): 403 itemsin [100.0-300.0): 48 itemsin [300.0-1000.0): 21 itemsin [1000.0-3000.0): 8 itemsin [3000.0-inf): 6 itemsmin=10, mean=58.45105462412115, max=10307 Labels per document:in [0.0-1.0): 0 itemsin [1.0-3.0): 59639 itemsin [3.0-10.0): 7853 itemsin [10.0-30.0): 13 itemsin [30.0-100.0): 0 itemsin [100.0-300.0): 0 itemsin [300.0-1000.0): 0 itemsin [1000.0-3000.0): 0 itemsin [3000.0-inf): 0 itemsmin=1, mean=1.6010073327901637, max=14 Features_mtx: type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 97179), nnz=4158000, dtype=float32 Documents per feature:in [0.0-1.0): 0 itemsin [1.0-3.0): 32381 itemsin [3.0-10.0): 39198 itemsin [10.0-30.0): 13271 itemsin [30.0-100.0): 7070 itemsin [100.0-300.0): 2945 itemsin [300.0-1000.0): 1560 itemsin [1000.0-3000.0): 568 itemsin [3000.0-inf): 186 itemsmin=2, mean=42.787021887444816, max=23616 Features per document:in [0.0-1.0): 0 itemsin [1.0-3.0): 47 itemsin [3.0-10.0): 10085 itemsin [10.0-30.0): 18785 itemsin [30.0-100.0): 28828 itemsin [100.0-300.0): 7976 itemsin [300.0-1000.0): 1731 itemsin [1000.0-3000.0): 53 itemsin [3000.0-inf): 0 itemsmin=1, mean=61.59543737500926, max=2735

提供机构：

Gdańsk University of Technology

创建时间：

2023-04-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集