Datasets of word network topic model

Name: Datasets of word network topic model
Creator: figshare
Published: 2025-05-01 06:55:56
License: 暂无描述

DataCite Commons2025-05-01 更新2024-07-25 收录

下载链接：

https://figshare.com/articles/dataset/Datasets_of_word_network_topic_model/5572588/1

下载链接

链接失效反馈

官方服务：

资源简介：

Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com) in the form of bags-of-words. ----------------------------------------------------- Data Set Characteristics: TextNumber of Micro-blogs:189,223Total Number of Words:3,252,492Size of the Vocabulary:20,942Associated Tasks: short text topic modeling and etc. ----------------------------------------------------- About Preprocessing For tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,words contain only one chinese-character were also removed. ----------------------------------------------------- Data Format The format of released data is setted as follows: [document_1][document_2]...[document_M] in which each line is one document. [document_i] is the ith document of the dataset that consists of a list of Ni words/terms. [document_i] = [word_i1] [word_i2] ... [word_iNi] in which all [word_ij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character. ----------------------------------------------------- If you have any questions about the data set, please contact: jichang@buaa.edu.cn.

提供机构：

figshare

创建时间：

2017-11-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集