NYtimes_train_test_set.hdf5
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12760692
下载链接
链接失效反馈官方服务:
资源简介:
The NYtimes dataset, part of the Bags of Words dataset from the UCI repository, comprises a collection of New York Times news articles represented as a bag of words. Each document in the dataset is associated with a set of word occurrences, where the dimensions represent unique words extracted from the articles. The dataset is organised as a document–word matrix, where each row corresponds to a document and each column corresponds to a word. The values in the matrix indicate the frequency of each word occurring in the respective document. Preprocessing steps include tokenization, removal of stopwords, and vocabulary truncation, with only words occurring more than ten times retained.
创建时间:
2024-07-17



