five

DGurgurov/romanian_sa

收藏
Hugging Face2024-05-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/DGurgurov/romanian_sa
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含一个用于罗马尼亚语的情感分析数据集,来源于Tache等人(2021)的研究。数据集包含15,000条正面和负面评论,这些评论收集自罗马尼亚最大的电子商务平台。研究使用了两种情感分类方法作为基线,一种是基于低层次特征(字符n-grams),另一种是基于高层次特征(通过聚类词嵌入生成的词袋嵌入)。此外,研究还使用自组织映射(SOMs)替代了k-means聚类算法,取得了更好的结果,因为生成的词嵌入簇更接近自然语言中的Zipf定律分布。

该数据集包含一个用于罗马尼亚语的情感分析数据集,来源于Tache等人(2021)的研究。数据集包含15,000条正面和负面评论,这些评论收集自罗马尼亚最大的电子商务平台。研究使用了两种情感分类方法作为基线,一种是基于低层次特征(字符n-grams),另一种是基于高层次特征(通过聚类词嵌入生成的词袋嵌入)。此外,研究还使用自组织映射(SOMs)替代了k-means聚类算法,取得了更好的结果,因为生成的词嵌入簇更接近自然语言中的Zipf定律分布。
提供机构:
DGurgurov
原始信息汇总

数据集概述

数据集名称

Sentiment Analysis Data for the Romanian Language

数据集描述

该数据集包含由Tache et al. (2021)提供的情感分析数据,用于支持罗马尼亚语的研究。

数据结构

数据用于改进低资源语言的图知识增强词嵌入项目。

语言

罗马尼亚语(ro)

任务类别

  • 文本分类

许可证

CC-BY-SA-4.0

引用信息

bibtex @inproceedings{tache-etal-2021-clustering, title = "Clustering Word Embeddings with Self-Organizing Maps. Application on {L}a{R}o{S}e{D}a - A Large {R}omanian Sentiment Data Set", author = "Tache, Anca and Mihaela, Gaman and Ionescu, Radu Tudor", booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume", month = apr, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.eacl-main.81", pages = "949--956", abstract = "Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from the largest Romanian e-commerce platform. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf{}s law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作