classla/ParlaSent
收藏ParlaSent 多语言议会辩论情感数据集 1.0
数据集描述
数据集概述
该数据集用于情感分析实验,包含五个训练数据集和两个测试集。测试集文件名以 _test.jsonl 结尾,并在数据集查看器中显示为 _additional_test。每个测试集包含 2,600 个句子,由一位高度训练的标注者进行标注。训练数据集内部划分为“训练”、“开发”和“测试”部分,用于执行特定语言的实验。
标注者使用的 6 级标注方案如下:
- Positive:完全或主要为正面情感的句子
- Negative:完全或主要为负面情感的句子
- M_Positive:传达模糊情感或混合情感,但更倾向于正面情感的句子
- M_Negative:传达模糊情感或混合情感,但更倾向于负面情感的句子
- P_Neutral:仅包含非情感相关陈述,但仍更倾向于正面情感的句子
- N_Neutral:仅包含非情感相关陈述,但仍更倾向于负面情感的句子
数据属性
训练数据的属性包括:
- sentence:标注情感的句子
- country:句子来源的议会所属国家
- annotator1:第一位标注者的标注
- annotator2:第二位标注者的标注
- reconciliation:经过调和后的最终标签
- label:基于调和标签的三级(正面、负面、中性)标签
- document_id:句子来源文档的内部标识符
- sentence_id:文档内句子的内部标识符
- term:句子来源的议会届次
- date:句子在议会中作为演讲一部分发表的日期
- name:发表演讲的议员姓名
- party:议员的政党
- gender:议员的二元性别
- birth year:议员的出生年份
- split:句子在训练部分中用于训练、开发或测试实例的情况
- ruling:议员在发表演讲时是否属于执政联盟或反对派
测试数据(_test.jsonl 文件)的属性包括:
- sentence:标注情感的句子
- country:句子来源的议会所属国家
- annotator1:第一位(唯一)标注者的标注,用作最终标注
- label:基于
annotator1标签的三级(正面、负面、中性)标签 - document_id:句子来源文档的内部标识符
- sentence_id:文档内句子的内部标识符
- term:句子来源的议会届次
- date:句子在议会中作为演讲一部分发表的日期
- name:发表演讲的议员姓名
- party:议员的政党
- gender:议员的二元性别
- birth year:议员的出生年份
- ruling:议员在发表演讲时是否属于执政联盟或反对派
引用信息
请引用以下论文:
@article{ Mochtak_Rupnik_Ljubešić_2023, title={The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings}, rights={All rights reserved}, url={http://arxiv.org/abs/2309.09783}, abstractNote={Sentiments inherently drive politics. How we receive and process information plays an essential role in political decision-making, shaping our judgment with strategic consequences both on the level of legislators and the masses. If sentiment plays such an important role in politics, how can we study and measure it systematically? The paper presents a new dataset of sentiment-annotated sentences, which are used in a series of experiments focused on training a robust sentiment classifier for parliamentary proceedings. The paper also introduces the first domain-specific LLM for political science applications additionally pre-trained on 1.72 billion domain-specific words from proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training of LLM on parliamentary data can significantly improve the model downstream performance on the domain-specific tasks, in our case, sentiment detection in parliamentary proceedings. We further show that multilingual models perform very well on unseen languages and that additional data from other languages significantly improves the target parliament’s results. The paper makes an important contribution to multiple domains of social sciences and bridges them with computer science and computational linguistics. Lastly, it sets up a more robust approach to sentiment analysis of political texts in general, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.}, note={arXiv:2309.09783 [cs]}, number={arXiv:2309.09783}, publisher={arXiv}, author={Mochtak, Michal and Rupnik, Peter and Ljubešić, Nikola}, year={2023}, month={Sep}, language={en} }




