300 topics in "Exploring the Subject Heterogeneity of Scientific Research Project Funding"
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/dyy2wxf3kk
下载链接
链接失效反馈官方服务:
资源简介:
We analyze topic distribution is extracted using the LDA topic model. We first build a text corpus based on the titles, abstracts, and keywords of the 115,813 papers through normalization techniques of natural language processing, including word segmentation, stop word removal, and standardized stemming. Then, we use the LDA topic model to extract topics and correspond each of them to several high-frequency words. The perplexity index is used to determine the number of topics--the smaller the perplexity is, the stronger the overall effect of the model. It can be seen from the graph that the perplexity reaches the lowest value when the number of topics is 300. So we choose 300 topics as the reference for the following analysis on topic attributes. The table contains all the theme-words contained in 300 topics
本研究针对通过LDA主题模型(LDA topic model)提取得到的主题分布展开分析。首先,我们基于115813篇论文的标题、摘要与关键词构建文本语料库,并通过自然语言处理(natural language processing)归一化技术对语料进行预处理,具体包括分词、停用词移除以及标准化词干提取。随后,我们采用LDA主题模型提取主题,并为每个主题匹配若干高频词汇。我们以困惑度(perplexity)指标确定最优主题数量:困惑度越低,模型的整体性能越强。从图表中可见,当主题数量为300时,困惑度达到最低值,因此我们选取300个主题作为后续主题属性分析的参照基准。本表格包含了这300个主题所涵盖的全部主题词。
创建时间:
2022-01-18



