DGurgurov/romanian_sa
收藏数据集概述
数据集名称
Sentiment Analysis Data for the Romanian Language
数据集描述
该数据集包含由Tache et al. (2021)提供的情感分析数据,用于支持罗马尼亚语的研究。
数据结构
数据用于改进低资源语言的图知识增强词嵌入项目。
语言
罗马尼亚语(ro)
任务类别
- 文本分类
许可证
CC-BY-SA-4.0
引用信息
bibtex @inproceedings{tache-etal-2021-clustering, title = "Clustering Word Embeddings with Self-Organizing Maps. Application on {L}a{R}o{S}e{D}a - A Large {R}omanian Sentiment Data Set", author = "Tache, Anca and Mihaela, Gaman and Ionescu, Radu Tudor", booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume", month = apr, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.eacl-main.81", pages = "949--956", abstract = "Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from the largest Romanian e-commerce platform. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf{}s law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic.", }



