DGurgurov/romanian_sa

Name: DGurgurov/romanian_sa
Creator: DGurgurov
Published: 2024-05-30 12:48:49
License: 暂无描述

Hugging Face2024-05-30 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/DGurgurov/romanian_sa

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含一个用于罗马尼亚语的情感分析数据集，来源于Tache等人（2021）的研究。数据集包含15,000条正面和负面评论，这些评论收集自罗马尼亚最大的电子商务平台。研究使用了两种情感分类方法作为基线，一种是基于低层次特征（字符n-grams），另一种是基于高层次特征（通过聚类词嵌入生成的词袋嵌入）。此外，研究还使用自组织映射（SOMs）替代了k-means聚类算法，取得了更好的结果，因为生成的词嵌入簇更接近自然语言中的Zipf定律分布。

提供机构：

DGurgurov

原始信息汇总

数据集概述

数据集名称

Sentiment Analysis Data for the Romanian Language

数据集描述

该数据集包含由Tache et al. (2021)提供的情感分析数据，用于支持罗马尼亚语的研究。

数据结构

数据用于改进低资源语言的图知识增强词嵌入项目。

语言

罗马尼亚语（ro）

任务类别

文本分类

许可证

CC-BY-SA-4.0

引用信息

bibtex @inproceedings{tache-etal-2021-clustering, title = "Clustering Word Embeddings with Self-Organizing Maps. Application on {L}a{R}o{S}e{D}a - A Large {R}omanian Sentiment Data Set", author = "Tache, Anca and Mihaela, Gaman and Ionescu, Radu Tudor", booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume", month = apr, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.eacl-main.81", pages = "949--956", abstract = "Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from the largest Romanian e-commerce platform. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf{}s law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集