sagteam/cedr_v1

Name: sagteam/cedr_v1
Creator: sagteam
Published: 2024-01-18 14:11:21
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/sagteam/cedr_v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language_creators: - found language: - ru license: - apache-2.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - sentiment-classification - multi-label-classification pretty_name: The Corpus for Emotions Detecting in Russian-language text sentences (CEDR) tags: - emotion-classification dataset_info: - config_name: enriched features: - name: text dtype: string - name: labels sequence: class_label: names: '0': joy '1': sadness '2': surprise '3': fear '4': anger - name: source dtype: string - name: sentences list: list: - name: forma dtype: string - name: lemma dtype: string splits: - name: train num_bytes: 4792338 num_examples: 7528 - name: test num_bytes: 1182315 num_examples: 1882 download_size: 2571516 dataset_size: 5974653 - config_name: main features: - name: text dtype: string - name: labels sequence: class_label: names: '0': joy '1': sadness '2': surprise '3': fear '4': anger - name: source dtype: string splits: - name: train num_bytes: 1418343 num_examples: 7528 - name: test num_bytes: 350263 num_examples: 1882 download_size: 945328 dataset_size: 1768606 configs: - config_name: enriched data_files: - split: train path: enriched/train-* - split: test path: enriched/test-* - config_name: main data_files: - split: train path: main/train-* - split: test path: main/test-* default: true --- # Dataset Card for [cedr] ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [GitHub](https://github.com/sag111/CEDR) - **Repository:** [GitHub](https://github.com/sag111/CEDR) - **Paper:** [ScienceDirect](https://www.sciencedirect.com/science/article/pii/S1877050921013247) - **Leaderboard:** - **Point of Contact:** [@sag111](mailto:sag111@mail.ru) ### Dataset Summary The Corpus for Emotions Detecting in Russian-language text sentences of different social sources (CEDR) contains 9410 comments labeled for 5 emotion categories (joy, sadness, surprise, fear, and anger). Here are 2 dataset configurations: - "main" - contains "text", "labels", and "source" features; - "enriched" - includes all "main" features and "sentences". Dataset with predefined train/test splits. ### Supported Tasks and Leaderboards This dataset is intended for multi-label emotion classification. ### Languages The data is in Russian. ## Dataset Structure ### Data Instances Each instance is a text sentence in Russian from several sources with one or more emotion annotations (or no emotion at all). An example for an instance from the dataset is shown below: ``` { 'text': 'Забавно как люди в возрасте удивляются входящим звонкам на мобильник)', 'labels': [0], 'source': 'twitter', 'sentences': [ [ {'forma': 'Забавно', 'lemma': 'Забавно'}, {'forma': 'как', 'lemma': 'как'}, {'forma': 'люди', 'lemma': 'человек'}, {'forma': 'в', 'lemma': 'в'}, {'forma': 'возрасте', 'lemma': 'возраст'}, {'forma': 'удивляются', 'lemma': 'удивляться'}, {'forma': 'входящим', 'lemma': 'входить'}, {'forma': 'звонкам', 'lemma': 'звонок'}, {'forma': 'на', 'lemma': 'на'}, {'forma': 'мобильник', 'lemma': 'мобильник'}, {'forma': ')', 'lemma': ')'} ] ] } ``` Emotion label codes: {0: "joy", 1: "sadness", 2: "surprise", 3: "fear", 4: "anger"} ### Data Fields The main configuration includes: - text: the text of the sentence; - labels: the emotion annotations; - source: the tag name of the corresponding source In addition to the above, the raw data includes: - sentences: text tokenized and lemmatized with [udpipe](https://ufal.mff.cuni.cz/udpipe) - 'forma': the original word form; - 'lemma': the lemma of this word ### Data Splits The dataset includes a set of train/test splits. with 7528, and 1882 examples respectively. ## Dataset Creation ### Curation Rationale The formed dataset of examples consists of sentences in Russian from several sources (blogs, microblogs, news), which allows creating methods to analyse various types of texts. The created methodology for building the dataset based on applying a crowdsourcing service can be used to expand the number of examples to improve the accuracy of supervised classifiers. ### Source Data #### Initial Data Collection and Normalization Data was collected from several sources: posts of the Live Journal social network, texts of the online news agency Lenta.ru, and Twitter microblog posts. Only those sentences were selected that contained marker words from the dictionary of [the emotive vocabulary of the Russian language](http://lexrus.ru/default.aspx?p=2876). The authors manually formed a list of marker words for each emotion by choosing words from different categories of the dictionary. In total, 3069 sentences were selected from LiveJournal posts, 2851 sentences from Lenta.Ru, and 3490 sentencesfrom Twitter. After selection, sentences were offered to annotators for labeling. #### Who are the source language producers? Russian-speaking LiveJournal and Tweeter users, and authors of news articles on the site lenta.ru. ### Annotations #### Annotation process Annotating sentences with labels of their emotions was performed with the help of [a crowdsourcing platform](https://yandex.ru/support/toloka/index.html?lang=en). The annotators’ task was: “What emotions did the author express in the sentence?”. The annotators were allowed to put an arbitrary number of the following emotion labels: "joy", "sadness", "anger", "fear", and "surprise". If the accuracy of an annotator on the control sentences (including the trial run) became less than 70%, or if the accuracy was less than 66% over the last six control samples, the annotator was dismissed. Sentences were split into tasks and assigned to annotators so that each sentence was annotated at least three times. A label of a specific emotion was assigned to a sentence if put by more than half of the annotators. #### Who are the annotators? Only those of the 30% of the best-performing active users (by the platform’s internal rating) who spoke Russian and were over 18 years old were allowed into the annotation process. Moreover, before a platform user could be employed as an annotator, they underwent a training task, after which they were to mark 25 trial samples with more than 80% agreement compared to the annotation that the authors had performed themselves. ### Personal and Sensitive Information The text of the sentences may contain profanity. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Researchers at AI technology lab at NRC "Kurchatov Institute". See the author [list](https://www.sciencedirect.com/science/article/pii/S1877050921013247). ### Licensing Information The GitHub repository which houses this dataset has an Apache License 2.0. ### Citation Information If you have found our results helpful in your work, feel free to cite our publication. This is an updated version of the dataset, the collection and preparation of which is described here: ``` @article{sboev2021data, title={Data-Driven Model for Emotion Detection in Russian Texts}, author={Sboev, Alexander and Naumov, Aleksandr and Rybka, Roman}, journal={Procedia Computer Science}, volume={190}, pages={637--642}, year={2021}, publisher={Elsevier} } ``` ### Contributions Thanks to [@naumov-al](https://github.com/naumov-al) for adding this dataset.

提供机构：

sagteam

原始信息汇总

数据集概述

数据集名称

名称: The Corpus for Emotions Detecting in Russian-language text sentences (CEDR)

语言

语言: 俄语 (ru)

许可证

许可证: Apache-2.0

多语言性

多语言性: 单语种

大小类别

大小类别: 1K<n<10K

来源数据集

来源数据集: 原始数据

任务类别

任务类别: 文本分类

任务ID

任务ID:
- 情感分类
- 多标签分类

数据集结构

数据字段

文本字段:
- 名称: text
- 数据类型: string
标签字段:
- 名称: labels
- 数据类型: sequence
- 类标签名称:
  - 0: joy
  - 1: sadness
  - 2: surprise
  - 3: fear
  - 4: anger
来源字段:
- 名称: source
- 数据类型: string
句子字段:
- 名称: sentences
- 数据类型: list
- 列表内容:
  - 名称: forma 数据类型: string
  - 名称: lemma 数据类型: string

数据分割

训练集:
- 名称: train
- 字节数: 4792338
- 示例数: 7528
测试集:
- 名称: test
- 字节数: 1182315
- 示例数: 1882

下载大小与数据集大小

下载大小: 2571516
数据集大小: 5974653

数据集创建

注释创建者

注释创建者: 众包

语言创建者

语言创建者: 发现

源数据

源数据收集:
- 来源:
  - Live Journal社交网络的帖子
  - Lenta.ru在线新闻机构的文本
  - Twitter微博客帖子
数据选择:
- 选择标准: 包含从俄语情感词汇词典中选取的标记词的句子

注释

注释过程:
- 平台: 众包平台
- 任务: 标注句子中的情感
- 情感标签: joy, sadness, anger, fear, surprise
- 质量控制: 标注者需通过测试，准确率需达到70%以上
标注者:
- 资格要求: 俄语流利，年龄超过18岁，平台内部评级前30%的用户
- 培训: 完成培训任务，标注25个样本，准确率需达到80%以上

搜集汇总

数据集介绍

构建方式

在情感计算领域，俄语文本情感分析资源相对稀缺，CEDR数据集的构建填补了这一空白。该数据集从多个社交平台采集原始文本，包括LiveJournal博客、Lenta.ru新闻及Twitter推文，通过情感词典中的标记词进行初步筛选，确保句子蕴含情感内容。随后采用众包标注策略，由经过筛选的俄语母语者进行多标签情感标注，每句至少由三位标注者独立完成，最终以多数原则确定情感标签，从而保证了标注的一致性与可靠性。

使用方法

该数据集适用于训练与评估俄语文本情感分类模型，用户可通过HuggingFace平台直接加载，并选择基础或增强配置进行实验。在模型开发过程中，建议利用其预划分的训练集与测试集进行监督学习，基础配置适用于端到端的深度学习模型，而增强配置中的语言学特征则可辅助特征工程或基于规则的模型优化。研究者亦可借鉴其众包标注方法，扩展数据集规模或适配其他语言，以推动跨语言情感分析研究的发展。

背景与挑战

背景概述

在自然语言处理领域，情感分析作为一项基础任务，长期聚焦于识别文本中的极性情感。然而，随着研究的深入，细粒度情感分析，特别是多标签情感分类，逐渐成为学术前沿。由俄罗斯国家研究中心“库尔恰托夫研究所”人工智能技术实验室的研究人员于2021年创建的CEDR（俄语文本句子情感检测语料库），正是这一趋势下的重要产物。该数据集旨在解决俄语文本中复杂情感状态的自动识别问题，涵盖了喜悦、悲伤、惊讶、恐惧和愤怒五种基本情感类别。其数据源多元化，整合了来自博客、新闻和微博等不同社交平台的文本，为开发能够适应多样化文本类型的分析方法提供了宝贵资源，显著推动了俄语情感计算领域的发展。

当前挑战

CEDR数据集所应对的核心领域挑战在于俄语多标签情感分类的复杂性。与单一情感标签不同，多标签分类要求模型能够同时识别文本中可能存在的多种交织情感，这对模型的表征和判别能力提出了更高要求。在构建过程中，数据集面临多重挑战。首要挑战在于高质量标注数据的获取，研究团队通过众包平台进行标注，并设计了严格的标注者筛选与质量控制流程，包括基于控制句的准确率阈值和多数投票机制，以确保标注的一致性与可靠性。其次，数据收集本身亦具挑战，需从异构的社交媒体和新闻源中，依据情感词典筛选出包含特定情感标记词的句子，这一过程涉及对俄语语言特性的深刻理解与数据处理技术的巧妙结合。

常用场景

经典使用场景

在俄语自然语言处理领域，CEDR数据集为多标签情感分类任务提供了关键资源。该数据集汇集了来自社交媒体、博客及新闻的俄语句子，标注了喜悦、悲伤、惊讶、恐惧和愤怒五种基本情感。研究者通常利用其训练和评估深度学习模型，如基于Transformer的架构，以识别文本中复杂且并存的情感表达。通过预定义的训练与测试划分，该数据集支持模型在跨源文本上的泛化能力验证，成为俄语情感计算研究的基准工具。

解决学术问题

CEDR数据集有效应对了俄语情感分析中数据稀缺与标注标准化的学术挑战。它通过众包标注机制，解决了情感主观性带来的标注一致性问题，为多标签分类提供了高质量标注范例。该数据集支持研究者探索跨领域情感迁移、细粒度情感检测以及语言特异性特征建模，推动了俄语情感计算的理论发展。其公开可用性降低了研究门槛，促进了跨语言情感模型的比较与创新。

实际应用

在实际应用中，CEDR数据集为俄语社交媒体监控、客户反馈分析和心理健康辅助工具开发提供了数据基础。企业可利用基于该数据集训练的模型，自动识别用户评论中的情感倾向，优化产品服务或进行舆情预警。在临床心理学领域，模型可辅助分析文本中的情感信号，为情绪障碍筛查提供参考。此外，其多源文本结构支持开发适应不同文体风格的情感分析系统，提升实际部署的鲁棒性。

数据集最近研究