Thematic Modelling of News Archive from University Website
收藏doi.org2025-03-26 收录
下载链接:
http://doi.org/10.17632/ckwcdwz6bg.3
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains a set of files to suuport and illustrate successive steps of thematic modeling for news line’s text docs and data for further investigations.
The file "etalon export_file.csv" presents 2000 Russian language news records, which is a part of the archive of the university website sstu.ru. Each record has a numerical record identifier, head of news, and URL-code of news.
The excel-file "News_tokens.xlsx" contains information about extracted tokens for news records after text processing. Text processing included tag eliminating, text cleaning, and word stemming as usual. The file column values are: "Record number", "Identifier of news", Head of news", "Number of words in the news", "Number of unique tokens in news", and "List of news tokens".
Then, thematic modeling was done for news in a Vowpal Wabbit format based on the probabilistic distribution of keywords in the text. The results of thematic modeling via the BigARTM platform (https://github.com/bigartm/bigartm, doi:10.5281/zenodo.288960) are presented in the files "phi_matrix.xlsx" and "theta_matrix.xlsx". After a series of experiments six topics were identified. Each of the 2000 lines in the "Theta_matrix" defines the probability of relevant news documents belonging to six topics. In the "Phi_matrix" we see the topic's probabilities for tokens. Under consideration of most frequent tokens (keywords) for each topic, the names for the topics were formulated as follows: "Events", "Holidays", "Science and Innovations", "Educational and Scientific Activities", "Student Competitions", '"Admission Campaign".
To determine the most significant topics for news, the probability values in "Theta_matrix" greater than some specially calculated bound were rounded to 1, i.e., the defuzzification of news distribution had been done. The Venn diagram on the picture in file "Venn_diagram.png" illustrates the belonging of news docs to the topics. The multiple sets visualization was performed using the "supervenn" package (https://github.com/gecko984/supervenn/tree/v0.3.1, doi:10.5281/zenodo.4016732).
Thus, the dataset of Newsline's docs can be used for educational purposes and science research in text processing, machine learning fields.
The full archive of news for 2009-2021 years, scrambled from the site www.sstu.ru, is in the file “News_Articles.csv”.
The source of new data was the archive of news articles of the Saratov State University (SSU) from April 2007 to May 2022. The primary data file was formed as a result of the web scrapping of the newsline of the SSU website from the start page www.sgu.ru/news/all. Then, duplicate records, as well as records with empty fields were removed, and text processing had been processed. As a result, the excel file "News_SGU_31077_Processed_1.xlsx" of 31077 records was formed.
The presented dataset can use for data modeling, as well as in sociological science investigations.
本数据集汇集了一系列文件,旨在支撑并展示新闻文本文档及数据在主题建模过程中的连续步骤,并为后续的深入研究提供支持。文件“etalon export_file.csv”展示了2000条俄语新闻记录,这些记录构成了萨马拉国立大学网站sstu.ru的存档的一部分。每条记录均包含一个数字记录标识符、新闻标题以及新闻的URL代码。Excel文件“News_tokens.xlsx”包含了新闻记录在文本处理后的提取token的信息。文本处理过程包括标签消除、文本清洗和词干提取等常规操作。文件列值包括:“记录编号”、“新闻标识符”、“新闻标题”、“新闻中的单词数量”、“新闻中的唯一token数量”以及“新闻token列表”。随后,基于文本中关键词的概率分布,在Vowpal Wabbit格式下对新闻进行了主题建模。通过BigARTM平台(https://github.com/bigartm/bigartm, doi:10.5281/zenodo.288960)进行主题建模的结果呈现于“phi_matrix.xlsx”和“theta_matrix.xlsx”文件中。经过一系列实验,共确定了六个主题。在“Theta_matrix”文件中的每2000行定义了相关新闻文档属于六个主题的概率。在“Phi_matrix”中,我们可以看到每个主题对token的概率。考虑每个主题最频繁的token(关键词)后,主题的命名如下:“事件”、“假日”、“科学技术与创新”、“教育科研活动”、“学生竞赛”、“招生活动”。为了确定新闻中最显著的主题,将“Theta_matrix”中大于特定计算阈值的概率值四舍五入为1,即完成了新闻分布的模糊化处理。文件“Venn_diagram.png”中的维恩图展示了新闻文档属于哪些主题。多重集合的可视化是通过“supervenn”包(https://github.com/gecko984/supervenn/tree/v0.3.1, doi:10.5281/zenodo.4016732)实现的。因此,Newsline的文档数据集可用于文本处理、机器学习领域的教育目的和科学研究。2009至2021年间的完整新闻存档,源自www.sstu.ru网站,存于“News_Articles.csv”文件中。新数据的来源是萨马拉国立大学(SSU)从2007年4月到2022年5月的新闻文章存档。原始数据文件是通过从SSU网站新闻首页www.sgu.ru/news/all进行网络爬取而形成的。随后,移除了重复记录以及字段为空的记录,并进行了文本处理。最终形成了包含31077条记录的Excel文件“News_SGU_31077_Processed_1.xlsx”。所展示的数据集可用于数据建模,以及社会学科学研究的调查。
提供机构:
Mendeley Data



