five

Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods

收藏
DataCite Commons2021-03-24 更新2024-07-28 收录
下载链接:
https://scielo.figshare.com/articles/dataset/Automated_text_clustering_of_newspaper_and_scientific_texts_in_brazilian_portuguese_analysis_and_comparison_of_methods/14287844
下载链接
链接失效反馈
官方服务:
资源简介:
This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering),2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms),3.Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.

本文报告了一项针对巴西葡萄牙语(Brazilian Portuguese)文本的自动化文本聚类(Automated Text Clustering)实证研究成果,研究对象为科学期刊论文与报纸文本,旨在探寻能够将输入文本按照其原始主题群组进行聚类的最优计算方法。 本研究共开展四组实验,每组实验包含四项流程: 1. 语料选择(Corpus Selections):选取待聚类的文本集合; 2. 词性筛选(Word Class Selections):通过特定算法从每篇文本中提取名词、动词与形容词; 3. 过滤算法(Filtering Algorithms):从前序阶段的结果中选取术语集合,为每个术语赋予语义权重,并为每篇文本生成索引; 4. 聚类算法(Clustering Algorithms):将Simple K-Means、sIB与EM三种聚类算法应用于生成的文本索引。 完成上述流程后,研究人员收集了聚类准确率与聚类耗时的统计结果。 sIB聚类算法是适配科学论文与报纸语料的最优选择,但该算法需在运行前由用户指定聚类簇的数量。针对报纸语料,其1分钟内的聚类准确率可达68.9%;针对科学论文语料,1分钟内准确率可达77.8%。 EM聚类算法可无需用户干预自动预估聚类簇的数量,但其最优场景下的聚类准确率不足53%。 结合本次实验结果可知,人工文本分类与自动化聚类的性能仍存在较大差距;同时还观察到,聚类准确率会随输入文本的数量及其主题分布发生变化。
提供机构:
SciELO journals
创建时间:
2021-03-24
二维码
社区交流群
二维码
科研交流群
商业服务