Tourist attraction description text data.
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Tourist_attraction_description_text_data_/27259831
下载链接
链接失效反馈官方服务:
资源简介:
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
文本分类(Text classification)作为文本挖掘的重要研究方向,能够快速高效地提取有价值信息,以应对大数据时代下大规模文本数据组织与管理的挑战。当前,文本分类相关研究多聚焦于信息过滤、信息检索、舆情监测、图书情报等领域的应用,而将文本分类方法应用于旅游景区领域的研究相对较少。有鉴于此,本文通过网络爬虫技术构建了旅游景区描述文本语料库。本文提出一种结合Word2Vec词嵌入(Word2Vec)与TF-IDF-CRF-POS加权的新型文本表示方法,通过引入总相对词频、类别区分度以及词性信息对传统TF-IDF进行优化。随后,将所提算法分别与七种性能优异的常用分类器(决策树(DT)、支持向量机(SVM)、逻辑回归(LR)、朴素贝叶斯(NB)、多层感知机(MLP)、随机森林(RF)、K近邻(KNN))相结合,针对国家级A级旅游景区的六个子类别实现多分类文本分类任务。通过与多种常用文本表示方法在整体性能、特定类别性能以及模型稳定性三方面进行对比,验证了所提算法的有效性与优越性。实验结果表明,相较于其他方法,本文提出的新型算法在该专业数据集上取得了更高的准确率(Acc)与F1值,甚至优于当前行业主流的高性能BERT分类模型,其准确率(Acc)、宏平均F1(macro-F1)与微平均F1(micro-F1)值分别提升2.29%、5.55%与2.90%。此外,该算法能够识别不平衡数据集中的稀有类别,且在不同规模的数据集上均表现出更优的稳定性。总体而言,本文所提算法具备更优异的分类性能与鲁棒性。同时,预测值与真实值所得结论一致,证明该算法具备实际应用价值。本文所使用的专业领域文本数据集因自身复杂性(文本长度不均、类别分布相对不平衡)以及类别间相似度较高而具备更高的分类挑战,但所提算法能够高效完成该类文本集合的多子类分类任务,这是针对特定领域复杂中文文本数据集应用研究的有益探索,同时也为内容相似的文本数据集的向量表示与分类任务提供了有价值的参考依据。
创建时间:
2024-10-18



