google-research-datasets/go_emotions|情绪分析数据集|文本分类数据集

hugging_face2024-01-04 更新2024-06-15 收录

情绪分析

文本分类

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/go_emotions

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - crowdsourced language_creators: - found language: - en license: - apache-2.0 multilinguality: - monolingual size_categories: - 100K<n<1M - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification - multi-label-classification paperswithcode_id: goemotions pretty_name: GoEmotions config_names: - raw - simplified tags: - emotion dataset_info: - config_name: raw features: - name: text dtype: string - name: id dtype: string - name: author dtype: string - name: subreddit dtype: string - name: link_id dtype: string - name: parent_id dtype: string - name: created_utc dtype: float32 - name: rater_id dtype: int32 - name: example_very_unclear dtype: bool - name: admiration dtype: int32 - name: amusement dtype: int32 - name: anger dtype: int32 - name: annoyance dtype: int32 - name: approval dtype: int32 - name: caring dtype: int32 - name: confusion dtype: int32 - name: curiosity dtype: int32 - name: desire dtype: int32 - name: disappointment dtype: int32 - name: disapproval dtype: int32 - name: disgust dtype: int32 - name: embarrassment dtype: int32 - name: excitement dtype: int32 - name: fear dtype: int32 - name: gratitude dtype: int32 - name: grief dtype: int32 - name: joy dtype: int32 - name: love dtype: int32 - name: nervousness dtype: int32 - name: optimism dtype: int32 - name: pride dtype: int32 - name: realization dtype: int32 - name: relief dtype: int32 - name: remorse dtype: int32 - name: sadness dtype: int32 - name: surprise dtype: int32 - name: neutral dtype: int32 splits: - name: train num_bytes: 55343102 num_examples: 211225 download_size: 24828322 dataset_size: 55343102 - config_name: simplified features: - name: text dtype: string - name: labels sequence: class_label: names: '0': admiration '1': amusement '2': anger '3': annoyance '4': approval '5': caring '6': confusion '7': curiosity '8': desire '9': disappointment '10': disapproval '11': disgust '12': embarrassment '13': excitement '14': fear '15': gratitude '16': grief '17': joy '18': love '19': nervousness '20': optimism '21': pride '22': realization '23': relief '24': remorse '25': sadness '26': surprise '27': neutral - name: id dtype: string splits: - name: train num_bytes: 4224138 num_examples: 43410 - name: validation num_bytes: 527119 num_examples: 5426 - name: test num_bytes: 524443 num_examples: 5427 download_size: 3464371 dataset_size: 5275700 configs: - config_name: raw data_files: - split: train path: raw/train-* - config_name: simplified data_files: - split: train path: simplified/train-* - split: validation path: simplified/validation-* - split: test path: simplified/test-* default: true --- # Dataset Card for GoEmotions ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/google-research/google-research/tree/master/goemotions - **Repository:** https://github.com/google-research/google-research/tree/master/goemotions - **Paper:** https://arxiv.org/abs/2005.00547 - **Leaderboard:** - **Point of Contact:** [Dora Demszky](https://nlp.stanford.edu/~ddemszky/index.html) ### Dataset Summary The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The raw data is included as well as the smaller, simplified version of the dataset with predefined train/val/test splits. ### Supported Tasks and Leaderboards This dataset is intended for multi-class, multi-label emotion classification. ### Languages The data is in English. ## Dataset Structure ### Data Instances Each instance is a reddit comment with a corresponding ID and one or more emotion annotations (or neutral). ### Data Fields The simplified configuration includes: - `text`: the reddit comment - `labels`: the emotion annotations - `comment_id`: unique identifier of the comment (can be used to look up the entry in the raw dataset) In addition to the above, the raw data includes: * `author`: The Reddit username of the comment's author. * `subreddit`: The subreddit that the comment belongs to. * `link_id`: The link id of the comment. * `parent_id`: The parent id of the comment. * `created_utc`: The timestamp of the comment. * `rater_id`: The unique id of the annotator. * `example_very_unclear`: Whether the annotator marked the example as being very unclear or difficult to label (in this case they did not choose any emotion labels). In the raw data, labels are listed as their own columns with binary 0/1 entries rather than a list of ids as in the simplified data. ### Data Splits The simplified data includes a set of train/val/test splits with 43,410, 5426, and 5427 examples respectively. ## Dataset Creation ### Curation Rationale From the paper abstract: > Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. ### Source Data #### Initial Data Collection and Normalization Data was collected from Reddit comments via a variety of automated methods discussed in 3.1 of the paper. #### Who are the source language producers? English-speaking Reddit users. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? Annotations were produced by 3 English-speaking crowdworkers in India. ### Personal and Sensitive Information This dataset includes the original usernames of the Reddit users who posted each comment. Although Reddit usernames are typically disasociated from personal real-world identities, this is not always the case. It may therefore be possible to discover the identities of the individuals who created this content in some cases. ## Considerations for Using the Data ### Social Impact of Dataset Emotion detection is a worthwhile problem which can potentially lead to improvements such as better human/computer interaction. However, emotion detection algorithms (particularly in computer vision) have been abused in some cases to make erroneous inferences in human monitoring and assessment applications such as hiring decisions, insurance pricing, and student attentiveness (see [this article](https://www.unite.ai/ai-now-institute-warns-about-misuse-of-emotion-detection-software-and-other-ethical-issues/)). ### Discussion of Biases From the authors' github page: > Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India. All these likely affect labelling, precision, and recall for a trained model. Anyone using this dataset should be aware of these limitations of the dataset. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Researchers at Amazon Alexa, Google Research, and Stanford. See the [author list](https://arxiv.org/abs/2005.00547). ### Licensing Information The GitHub repository which houses this dataset has an [Apache License 2.0](https://github.com/google-research/google-research/blob/master/LICENSE). ### Citation Information @inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} } ### Contributions Thanks to [@joeddav](https://github.com/joeddav) for adding this dataset.

提供机构：

google-research-datasets

原始信息汇总

GoEmotions 数据集概述

数据集描述

数据集摘要

GoEmotions 数据集包含 58k 条精心筛选的 Reddit 评论，标记了 27 种情感类别或中性情感。数据集包括原始数据和简化的版本，后者包含预定义的训练/验证/测试集。

支持的任务和排行榜

该数据集适用于多类别、多标签情感分类任务。

语言

数据集中的文本为英语。

数据集结构

数据实例

每个实例是一个 Reddit 评论，包含一个对应的 ID 和一个或多个情感标注（或中性）。

数据字段

简化配置包括：

text: Reddit 评论文本
labels: 情感标注
comment_id: 评论的唯一标识符（可用于在原始数据集中查找条目）

原始数据还包括：

author: 评论作者的 Reddit 用户名
subreddit: 评论所属的子版块
link_id: 评论的链接 ID
parent_id: 评论的父 ID
created_utc: 评论的时间戳
rater_id: 标注者的唯一 ID
example_very_unclear: 标注者是否标记该示例非常不清楚或难以标注（在这种情况下，他们没有选择任何情感标签）

在原始数据中，标签以独立的列形式列出，包含二进制 0/1 条目，而不是像简化数据中那样的 ID 列表。

数据分割

简化数据包括一组训练/验证/测试集，分别包含 43,410、5426 和 5427 个示例。

数据集创建

策划理由

从论文摘要中：

理解语言中表达的情感有广泛的应用，从构建同理心的聊天机器人到检测有害的在线行为。这一领域的进步可以通过使用具有细粒度分类法的大型数据集来改善，这些分类法适用于多个下游任务。

源数据

初始数据收集和规范化

数据从 Reddit 评论中通过多种自动化方法收集，具体讨论见论文的 3.1 节。

源语言生产者

英语母语的 Reddit 用户。

标注

标注过程

[更多信息需要]

标注者

标注由 3 名英语母语的印度众包工作者完成。

个人和敏感信息

该数据集包括发布每条评论的 Reddit 用户的原始用户名。虽然 Reddit 用户名通常与个人真实世界的身份无关，但这并不总是如此。因此，在某些情况下，可能有可能发现创建这些内容的人的身份。

使用数据的注意事项

数据集的社会影响

情感检测是一个有价值的问题，可能会带来改进，例如更好的人机交互。然而，情感检测算法（特别是在计算机视觉中）有时会被滥用，在招聘决策、保险定价和学生注意力等人类监控和评估应用中做出错误的推断（参见这篇文章）。

偏见的讨论

从作者的 GitHub 页面：

数据中可能存在的偏见包括：Reddit 和用户基础的固有偏见、用于数据过滤的冒犯性/粗俗词汇列表、评估冒犯性身份标签时的固有或无意识偏见，以及所有标注者都是来自印度的英语母语者。所有这些都可能影响标注、精确度和召回率。任何使用此数据集的人都应意识到这些数据集的局限性。

其他已知限制

[更多信息需要]

附加信息

数据集策展人

亚马逊 Alexa、谷歌研究和斯坦福大学的研究人员。参见作者列表。

许可信息

该数据集所在的 GitHub 仓库具有Apache License 2.0。

引用信息

@inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} }

贡献

感谢 @joeddav 添加此数据集。

AI搜集汇总

数据集介绍

构建方式

GoEmotions数据集的构建是基于对Reddit平台上58k条评论的精细标注，涵盖了27种情感类别及中性类别。构建过程中，数据首先通过自动化方法从Reddit收集，然后由来自印度的3名英语母语 crowdworkers 进行情感标注。原始数据包含用户名、帖子时间戳等详细信息，而简化的数据集则提供预先划分的训练、验证和测试集，以便于研究者使用。

使用方法

使用GoEmotions数据集时，研究者可以根据自身需求选择原始数据或简化数据。原始数据包含更丰富的字段信息，适用于需要深入分析的研究；简化数据则提供了便捷的数据 splits，适合快速进行模型训练和评估。在使用时，应充分注意数据集的潜在偏见和局限性，合理设计实验方案，以确保研究结果的可靠性和有效性。

背景与挑战

背景概述

GoEmotions数据集，由Google Research、Amazon Alexa和Stanford的研究人员共同创建，旨在推动情感理解的研究与应用。该数据集包含58,000条经过精心筛选的Reddit评论，标注了27种情感类别或中性状态，其精细化的情感分类体系为多下游任务提供了适应性。自2020年发布以来，GoEmotions数据集已成为自然语言处理领域情感分析任务的重要资源，对构建富有同理心的聊天机器人、检测网络有害行为等方面产生了显著影响。

当前挑战

在构建过程中，GoEmotions数据集面临了多方面的挑战：首先，数据来源的多样性和复杂性要求研究者在数据收集和标准化过程中采用多种自动化方法；其次，情感标注的主观性导致标注过程中可能出现偏差，特别是在跨文化背景下；此外，数据集中包含的用户信息可能涉及个人隐私问题，需要在数据使用中进行谨慎处理。在研究领域问题方面，GoEmotions数据集的挑战包括如何提高情感分类的准确性和鲁棒性，以及如何减少算法偏见，确保公平性和透明性。

常用场景

经典使用场景

在自然语言处理领域，GoEmotions数据集以其精细的情绪分类而备受瞩目，其经典的使用场景主要集中于情感分析和情绪识别任务。研究者们利用该数据集，可以训练模型以识别文本中表达的各种细微情感，如喜悦、悲伤、愤怒等，从而提升人机交互的自然性和智能水平。

解决学术问题

GoEmotions数据集解决了学术研究中对于细粒度情绪标注的需求，以往的研究往往只区分正面和负面情绪，而该数据集提供了27种不同的情绪类别，使得研究能够更深入地探索情绪的多样性和复杂性，进而推动情感计算和自然语言理解的发展。

实际应用

在实际应用中，GoEmotions数据集可被用于开发更智能的聊天机器人，提升用户体验；在社交媒体分析中，可用于监测和评估公众情绪，对市场趋势和危机管理提供数据支持；在教育领域，该数据集亦有助于构建能够理解和响应学生情绪的教育应用程序。

数据集最近研究