Applying Topic Modelling to the Folk Song Corpus of the Podillia Region

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14281802

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is based on Ukrainian folk songs from the Podillia region collections (Dei 1965; Yefremova & Dmytrenko 2014; Myshanych 1976). The text corpus comprises 2,762 songs, containing a total of 52,004 lines and 209,075 tokens. The corpus of Podillia folk songs is in Ukrainian. It was analysed using the R programming language (version 4.4.1) along with RStudio (version: 2024.04.2+764). The text analysis code was developed at the Estonian Literary Museum. This dataset includes the following files: 1. TM_Ukr.folk_songs.R: R script for the topic modeling analysis of the corpus. It covers text preprocessing, stopwords removal, tokenization, lemmatization, part-of-speech (PoS) tagging, document-term matrix (DTM) construction, latent Dirichlet allocation (LDA) analysis, coherence evaluation metrics, word embeddings using the GloVe algorithm, and K-means clustering. 2. corpus_Podillia_folk_songs.csv: A file containing the text data of Podillia region folk songs.

创建时间：

2024-12-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集