five

Applying Topic Modelling to the Folk Song Corpus of the Podillia Region

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14281802
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is based on Ukrainian folk songs from the Podillia region collections (Dei 1965; Yefremova & Dmytrenko 2014; Myshanych 1976). The text corpus comprises 2,762 songs, containing a total of 52,004 lines and 209,075 tokens. The corpus of Podillia folk songs is in Ukrainian. It was analysed using the R programming language (version 4.4.1) along with RStudio (version: 2024.04.2+764). The text analysis code was developed at the Estonian Literary Museum. This dataset includes the following files: 1. TM_Ukr.folk_songs.R:  R script for the topic modeling analysis of the corpus. It covers text preprocessing, stopwords removal, tokenization, lemmatization, part-of-speech (PoS) tagging, document-term matrix (DTM) construction, latent Dirichlet allocation (LDA) analysis, coherence evaluation metrics, word embeddings using the GloVe algorithm, and K-means clustering. 2. corpus_Podillia_folk_songs.csv: A file containing the text data of Podillia region folk songs.
创建时间:
2024-12-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作