classla/ParlaSent

Name: classla/ParlaSent
Creator: classla
Published: 2023-09-28 13:52:55
License: 暂无描述

Hugging Face2023-09-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/classla/ParlaSent

下载链接

链接失效反馈

官方服务：

资源简介：

ParlaSent 1.0是一个多语言议会辩论情感分析数据集，包含五个训练数据集和两个测试集。每个测试集包含2600个句子，由一位高度训练的注释者进行注释。训练数据集内部被分为“训练”、“开发”和“测试”部分，用于进行特定语言的实验。注释者使用的6级注释模式包括：Positive、Negative、M_Positive、M_Negative、P_Neutral和N_Neutral。数据集的属性包括句子、国家、注释者注释、最终标签、文档ID、句子ID、议会任期、日期、议员姓名、党派、性别、出生年份、数据集分割和执政情况。测试数据的属性与训练数据类似，但只有一个注释者的注释。

提供机构：

classla

原始信息汇总

ParlaSent 多语言议会辩论情感数据集 1.0

数据集描述

数据集概述

该数据集用于情感分析实验，包含五个训练数据集和两个测试集。测试集文件名以 _test.jsonl 结尾，并在数据集查看器中显示为 _additional_test。每个测试集包含 2,600 个句子，由一位高度训练的标注者进行标注。训练数据集内部划分为“训练”、“开发”和“测试”部分，用于执行特定语言的实验。

标注者使用的 6 级标注方案如下：

Positive：完全或主要为正面情感的句子
Negative：完全或主要为负面情感的句子
M_Positive：传达模糊情感或混合情感，但更倾向于正面情感的句子
M_Negative：传达模糊情感或混合情感，但更倾向于负面情感的句子
P_Neutral：仅包含非情感相关陈述，但仍更倾向于正面情感的句子
N_Neutral：仅包含非情感相关陈述，但仍更倾向于负面情感的句子

数据属性

训练数据的属性包括：

sentence：标注情感的句子
country：句子来源的议会所属国家
annotator1：第一位标注者的标注
annotator2：第二位标注者的标注
reconciliation：经过调和后的最终标签
label：基于调和标签的三级（正面、负面、中性）标签
document_id：句子来源文档的内部标识符
sentence_id：文档内句子的内部标识符
term：句子来源的议会届次
date：句子在议会中作为演讲一部分发表的日期
name：发表演讲的议员姓名
party：议员的政党
gender：议员的二元性别
birth year：议员的出生年份
split：句子在训练部分中用于训练、开发或测试实例的情况
ruling：议员在发表演讲时是否属于执政联盟或反对派

测试数据（_test.jsonl 文件）的属性包括：

sentence：标注情感的句子
country：句子来源的议会所属国家
annotator1：第一位（唯一）标注者的标注，用作最终标注
label：基于 annotator1 标签的三级（正面、负面、中性）标签
document_id：句子来源文档的内部标识符
sentence_id：文档内句子的内部标识符
term：句子来源的议会届次
date：句子在议会中作为演讲一部分发表的日期
name：发表演讲的议员姓名
party：议员的政党
gender：议员的二元性别
birth year：议员的出生年份
ruling：议员在发表演讲时是否属于执政联盟或反对派

引用信息

请引用以下论文：

@article{ Mochtak_Rupnik_Ljubešić_2023, title={The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings}, rights={All rights reserved}, url={http://arxiv.org/abs/2309.09783}, abstractNote={Sentiments inherently drive politics. How we receive and process information plays an essential role in political decision-making, shaping our judgment with strategic consequences both on the level of legislators and the masses. If sentiment plays such an important role in politics, how can we study and measure it systematically? The paper presents a new dataset of sentiment-annotated sentences, which are used in a series of experiments focused on training a robust sentiment classiﬁer for parliamentary proceedings. The paper also introduces the ﬁrst domain-speciﬁc LLM for political science applications additionally pre-trained on 1.72 billion domain-speciﬁc words from proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training of LLM on parliamentary data can signiﬁcantly improve the model downstream performance on the domain-speciﬁc tasks, in our case, sentiment detection in parliamentary proceedings. We further show that multilingual models perform very well on unseen languages and that additional data from other languages signiﬁcantly improves the target parliament’s results. The paper makes an important contribution to multiple domains of social sciences and bridges them with computer science and computational linguistics. Lastly, it sets up a more robust approach to sentiment analysis of political texts in general, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.}, note={arXiv:2309.09783 [cs]}, number={arXiv:2309.09783}, publisher={arXiv}, author={Mochtak, Michal and Rupnik, Peter and Ljubešić, Nikola}, year={2023}, month={Sep}, language={en} }

搜集汇总

数据集介绍

背景与挑战

背景概述

ParlaSent是一个多语言议会辩论情感分析数据集，包含斯洛文尼亚语、英语、捷克语等七种语言的约18,200个句子，每个句子都标注了情感标签（如积极、消极、中性）以及议员信息（如党派、性别和政治立场）。该数据集专为训练和评估议会领域的情感分类模型而设计，基于6级标注方案，并支持多语言比较研究，适用于政治文本的情感分析任务。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集