Datasets of word network topic model
收藏DataCite Commons2020-09-01 更新2024-07-25 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_of_word_network_topic_model/5572588
下载链接
链接失效反馈官方服务:
资源简介:
Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com) in the form of bags-of-words.<br>-----------------------------------------------------<br>Data Set Characteristics: TextNumber of Micro-blogs:189,223Total Number of Words:3,252,492Size of the Vocabulary:20,942Associated Tasks: short text topic modeling and etc.<br>-----------------------------------------------------<br>About Preprocessing<br>For tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,words contain only one chinese-character were also removed.<br>-----------------------------------------------------<br>Data Format<br>The format of released data is setted as follows:<br>[document_1][document_2]...[document_M]<br>in which each line is one document. [document_i] is the ith document of the dataset that consists of a list of Ni words/terms.<br>[document_i] = [word_i1] [word_i2] ... [word_iNi]<br>in which all [word_ij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.<br>-----------------------------------------------------<br>If you have any questions about the data set, please contact: jichang@buaa.edu.cn.
摘要:本数据集以词袋(bag-of-words)形式存储了从微博(Weibo,http://weibo.com)采样获取的单日微博内容。
-----------------------------------------------------
数据集特征:文本类
微博总条数:189,223条
总词数:3,252,492个
词汇表规模:20,942个
关联任务:短文本主题建模等。
-----------------------------------------------------
关于预处理
在分词(tokenization)环节,我们采用NLPIR工具进行处理。我们移除了停用词、词频(term-frequence)低于20的词汇,同时剔除了仅包含单个汉字的词语。
-----------------------------------------------------
数据格式
发布的数据格式如下:
[文档_1][文档_2]…[文档_M]
其中每行对应一篇独立文档。[文档_i]为数据集中的第i篇文档,由Ni个词/词项组成。
单篇文档的格式为:[文档_i] = [词_i1] [词_i2] … [词_iNi]
其中所有[词_ij](i=1..M,j=1..Ni)均为文本字符串,以空格作为分隔符。
-----------------------------------------------------
若您对该数据集存在任何疑问,请联系:jichang@buaa.edu.cn。
提供机构:
figshare
创建时间:
2017-11-05



