go_emotions
收藏魔搭社区2026-05-04 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/go_emotions
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for GoEmotions
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/google-research/google-research/tree/master/goemotions
- **Repository:** https://github.com/google-research/google-research/tree/master/goemotions
- **Paper:** https://arxiv.org/abs/2005.00547
- **Leaderboard:**
- **Point of Contact:** [Dora Demszky](https://nlp.stanford.edu/~ddemszky/index.html)
### Dataset Summary
The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral.
The raw data is included as well as the smaller, simplified version of the dataset with predefined train/val/test
splits.
### Supported Tasks and Leaderboards
This dataset is intended for multi-class, multi-label emotion classification.
### Languages
The data is in English.
## Dataset Structure
### Data Instances
Each instance is a reddit comment with a corresponding ID and one or more emotion annotations (or neutral).
### Data Fields
The simplified configuration includes:
- `text`: the reddit comment
- `labels`: the emotion annotations
- `comment_id`: unique identifier of the comment (can be used to look up the entry in the raw dataset)
In addition to the above, the raw data includes:
* `author`: The Reddit username of the comment's author.
* `subreddit`: The subreddit that the comment belongs to.
* `link_id`: The link id of the comment.
* `parent_id`: The parent id of the comment.
* `created_utc`: The timestamp of the comment.
* `rater_id`: The unique id of the annotator.
* `example_very_unclear`: Whether the annotator marked the example as being very unclear or difficult to label (in this
case they did not choose any emotion labels).
In the raw data, labels are listed as their own columns with binary 0/1 entries rather than a list of ids as in the
simplified data.
### Data Splits
The simplified data includes a set of train/val/test splits with 43,410, 5426, and 5427 examples respectively.
## Dataset Creation
### Curation Rationale
From the paper abstract:
> Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to
detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a
fine-grained typology, adaptable to multiple downstream tasks.
### Source Data
#### Initial Data Collection and Normalization
Data was collected from Reddit comments via a variety of automated methods discussed in 3.1 of the paper.
#### Who are the source language producers?
English-speaking Reddit users.
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
Annotations were produced by 3 English-speaking crowdworkers in India.
### Personal and Sensitive Information
This dataset includes the original usernames of the Reddit users who posted each comment. Although Reddit usernames
are typically disasociated from personal real-world identities, this is not always the case. It may therefore be
possible to discover the identities of the individuals who created this content in some cases.
## Considerations for Using the Data
### Social Impact of Dataset
Emotion detection is a worthwhile problem which can potentially lead to improvements such as better human/computer
interaction. However, emotion detection algorithms (particularly in computer vision) have been abused in some cases
to make erroneous inferences in human monitoring and assessment applications such as hiring decisions, insurance
pricing, and student attentiveness (see
[this article](https://www.unite.ai/ai-now-institute-warns-about-misuse-of-emotion-detection-software-and-other-ethical-issues/)).
### Discussion of Biases
From the authors' github page:
> Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India. All these likely affect labelling, precision, and recall for a trained model. Anyone using this dataset should be aware of these limitations of the dataset.
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
Researchers at Amazon Alexa, Google Research, and Stanford. See the [author list](https://arxiv.org/abs/2005.00547).
### Licensing Information
The GitHub repository which houses this dataset has an
[Apache License 2.0](https://github.com/google-research/google-research/blob/master/LICENSE).
### Citation Information
@inproceedings{demszky2020goemotions,
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)},
title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
year = {2020}
}
### Contributions
Thanks to [@joeddav](https://github.com/joeddav) for adding this dataset.
# GoEmotions 数据集卡片
## 目录
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## 数据集描述
- **主页:** https://github.com/google-research/google-research/tree/master/goemotions
- **仓库:** https://github.com/google-research/google-research/tree/master/goemotions
- **论文:** https://arxiv.org/abs/2005.00547
- **排行榜:**
- **联系人:** [Dora Demszky](https://nlp.stanford.edu/~ddemszky/index.html)
### 数据集概览
GoEmotions数据集包含5.8万条经过精心整理的Reddit评论,共标注了27种情感类别及**中性(Neutral)**标签。数据集同时提供原始数据与经过精简的简化版数据,后者包含预先划分好的训练集、验证集与测试集。
### 支持任务与排行榜
本数据集旨在用于多分类、多标签情感分类任务。
### 语言
本数据集语言为英语。
## 数据集结构
### 数据实例
每个数据实例为一条Reddit评论,附带对应ID以及一项或多项情感标注(或中性标签)。
### 数据字段
简化版数据集包含以下字段:
- `text`:Reddit评论原文
- `labels`:情感标注结果
- `comment_id`:评论的唯一标识符(可用于在原始数据集中检索对应条目)
除上述字段外,原始数据集还包含:
* `author`:评论作者的Reddit用户名
* `subreddit`:评论所属的Reddit子版块
* `link_id`:评论的链接ID
* `parent_id`:评论的父级ID
* `created_utc`:评论的时间戳
* `rater_id`:标注者的唯一ID
* `example_very_unclear`:标注者是否标记该示例存在极大歧义或难以标注(此种情况下标注者未选择任何情感标签)
在原始数据集中,情感标签以独立列的二进制0/1值形式呈现,而非简化版数据中的标签ID列表形式。
### 数据划分
简化版数据集包含预划分的训练集、验证集与测试集,样本数量分别为43410、5426与5427。
## 数据集构建
### 构建逻辑
摘自论文摘要:
> 理解语言中表达的情感拥有广泛应用场景,从构建共情型聊天机器人到检测有害网络行为。借助具备细粒度分类体系、可适配多种下游任务的大规模数据集,可推动该领域的研究进展。
### 源数据
#### 初始数据收集与预处理
本数据集通过多种自动化方法从Reddit评论中采集得到,具体细节详见论文3.1节。
#### 源语言生产者
数据来自英语使用者的Reddit用户。
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注者信息
标注工作由印度籍的3名英语母语众包标注者完成。
### 个人与敏感信息
本数据集包含每条评论发布者的原始Reddit用户名。尽管Reddit用户名通常与真实个人身份无直接关联,但并非绝对如此。在部分场景下,仍有可能据此追溯到内容创作者的真实身份。
## 数据集使用注意事项
### 数据集的社会影响
情感检测是一项极具价值的研究课题,有望推动人机交互等领域的优化升级。然而,情感检测算法(尤其是计算机视觉领域的相关算法)在部分场景中已被滥用,例如在招聘决策、保险定价、学生专注力评估等人类监测与评估应用中做出错误推断。详见[此文章](https://www.unite.ai/ai-now-institute-warns-about-misuse-of-emotion-detection-software-and-other-ethical-issues/)。
### 偏差讨论
摘自作者GitHub页面:
> 数据中可能存在的偏差包括:Reddit平台及其用户群体本身的固有偏差、用于数据过滤的冒犯性/粗俗词汇列表的偏差、冒犯性身份标签评估中的固有或无意识偏差,以及所有标注者均为印度籍英语母语者这一因素。上述因素均可能影响训练模型的标注效果、精确率与召回率。使用本数据集的人员应知晓数据集的此类局限性。
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
本数据集由亚马逊Alexa、谷歌研究院(Google Research)与斯坦福大学的研究人员维护。详见[作者列表](https://arxiv.org/abs/2005.00547)。
### 授权信息
本数据集所在的GitHub仓库采用[Apache License 2.0](https://github.com/google-research/google-research/blob/master/LICENSE)协议。
### 引用信息
bibtex
@inproceedings{demszky2020goemotions,
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
booktitle = {第58届国际计算语言学协会(ACL)年会},
title = {{GoEmotions: 细粒度情感数据集}},
year = {2020}
}
### 贡献致谢
感谢[@joeddav](https://github.com/joeddav)为本数据集添加至数据集仓库。
提供机构:
maas
创建时间:
2025-07-07



