five

Thaslima/goemotions

收藏
Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Thaslima/goemotions
下载链接
链接失效反馈
官方服务:
资源简介:
# GoEmotions **GoEmotions** is a corpus of 58k carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral. * Number of examples: 58,009. * Number of labels: 27 + Neutral. * Maximum sequence length in training and evaluation datasets: 30. On top of the raw data, we also include a version filtered based on reter-agreement, which contains a train/test/validation split: * Size of training dataset: 43,410. * Size of test dataset: 5,427. * Size of validation dataset: 5,426. The emotion categories are: _admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise_. For more details on the design and content of the dataset, please see our [paper](https://arxiv.org/abs/2005.00547). ## Data Our raw dataset can be retrieved by running: ``` wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv ``` See the `data` folder for more detailed data information. ### Data Format Our raw dataset, split into three csv files, includes all annotations as well as metadata on the comments. Each row represents a single rater's annotation for a single example. This file includes the following columns: * `text`: The text of the comment (with masked tokens, as described in the paper). * `id`: The unique id of the comment. * `author`: The Reddit username of the comment's author. * `subreddit`: The subreddit that the comment belongs to. * `link_id`: The link id of the comment. * `parent_id`: The parent id of the comment. * `created_utc`: The timestamp of the comment. * `rater_id`: The unique id of the annotator. * `example_very_unclear`: Whether the annotator marked the example as being very unclear or difficult to label (in this case they did not choose any emotion labels). * separate columns representing each of the emotion categories, with binary labels (0 or 1) The data we used for training the models includes examples where there is agreement between at least 2 raters. Our data includes 43,410 training examples (`train.tsv`), 5426 dev examples (`dev.tsv`) and 5427 test examples (`test.tsv`). These files have _no header row_ and have the following columns: 1. text 2. comma-separated list of emotion ids (the ids are indexed based on the order of emotions in `emotions.txt`) 3. id of the comment ### Visualization [Here](https://nlp.stanford.edu/~ddemszky/goemotions/tsne.html) you can view a TSNE projection showing a random sample of the data. The plot is generated using PPCA (see scripts below). Each point in the plot represents a single example and the text and the labels are shown on mouse-hover. The color of each point is the weighted average of the RGB values of the those emotions. ## Data Analysis See each script for more documentation and descriptive command line flags. * `python3 -m analyze_data`: get high-level statistics of the data and correlation among emotion ratings. * `python3 -m extract_words`: get the words that are significantly associated with each emotion, in contrast to the other emotions, based on their log odds ratio. * `python3 -m ppca`: run PPCA [(Cowen et al., 2019)](https://www.nature.com/articles/s41562-019-0533-6) on the data and generate plots. ### Tutorial We released a [detailed tutorial](https://github.com/tensorflow/models/blob/master/research/seq_flow_lite/demo/colab/emotion_colab.ipynb) for training a neural emotion prediction model. In it, we work through training a model architecture available on TensorFlow Model Garden using GoEmotions and applying it for the task of suggesting emojis based on conversational text. ## Citation If you use this code for your publication, please cite the original paper: ``` @inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} } ``` ## Contact [Dora Demszky](https://nlp.stanford.edu/~ddemszky/index.html) ## Disclaimer - We are aware that the dataset contains biases and is not representative of global diversity. - We are aware that the dataset contains potentially problematic content. - Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India. All these likely affect labelling, precision, and recall for a trained model. - The emotion pilot model used for sentiment labeling, was trained on examples reviewed by the research team. - Anyone using this dataset should be aware of these limitations of the dataset. ## Dataset Metadata The following table is necessary for this dataset to be indexed by search engines such as <a href="https://g.co/datasetsearch">Google Dataset Search</a>. <div itemscope itemtype="http://schema.org/Dataset"> <table> <tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">GoEmotions</code></td> </tr> <tr> <td>description</td> <td><code itemprop="description">GoEmotions contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The emotion categories are _admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise_.</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://github.com/google-research/google-research/tree/master/goemotions</code></td> </tr> <tr> <td>citation</td> <td><code itemprop="citation">https://identifiers.org/arxiv:2005.00547</code></td> </tr> <tr> <td>provider</td> <td> <div itemscope="" itemtype="http://schema.org/Organization" itemprop="provider"> <table> <tbody><tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">Google</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://en.wikipedia.org/wiki/Google</code></td> </tr> </tbody></table> </div> </td> </tr> </table> </div>

# GoEmotions **GoEmotions** 是一个从红迪网(Reddit)提取、经人工精心筛选标注的5.8万条评论语料库,附带27种情绪类别或中性(Neutral)的人工标注。 * 样本数量:58,009条 * 标签数量:27种情绪 + 中性类别 * 训练与评估数据集的最大序列长度:30 除原始数据外,我们还提供了基于评分者一致性进行过滤的版本,该版本包含训练/测试/验证集划分: * 训练集规模:43,410条 * 测试集规模:5,427条 * 验证集规模:5,426条 情绪类别包括:赞赏、愉悦、愤怒、厌烦、赞同、关怀、困惑、好奇、渴望、失望、反对、厌恶、尴尬、兴奋、恐惧、感激、悲痛、喜悦、喜爱、紧张、乐观、自豪、领悟、释然、懊悔、悲伤、惊讶。 如需了解该数据集的设计与内容的更多细节,请参阅我们的[论文](https://arxiv.org/abs/2005.00547)。 ## 数据 原始数据集可通过以下命令获取: wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv 如需了解更详细的数据信息,请参阅`data`文件夹。 ### 数据格式 原始数据集分为三个CSV文件,包含所有标注信息及评论元数据。每一行代表一名标注者对单条样本的标注结果。该文件包含以下列: * `text`:评论文本(带有论文中所述的掩码Token) * `id`:评论的唯一标识符 * `author`:评论作者的红迪网用户名 * `subreddit`:评论所属的红迪子版块 * `link_id`:评论的链接ID * `parent_id`:评论的父级ID * `created_utc`:评论的时间戳 * `rater_id`:标注者的唯一ID * `example_very_unclear`:标注者是否标记该样本为极难明确标注(此种情况下标注者未选择任何情绪标签) * 代表每个情绪类别的独立列,采用二元标签(0或1) 我们用于模型训练的数据包含至少2名标注者达成一致的样本。我们的数据包含43,410条训练样本(`train.tsv`)、5426条开发集(`dev.tsv`)与5427条测试样本(`test.tsv`)。这些文件**无表头行**,且包含以下列: 1. text 2. 以逗号分隔的情绪ID列表(ID索引基于`emotions.txt`中情绪的排列顺序) 3. 评论ID ### 可视化 [此处](https://nlp.stanford.edu/~ddemszky/goemotions/tsne.html)可查看随机采样数据的t分布邻域嵌入(t-SNE)投影可视化结果。该图表通过概率主成分分析(PPCA)生成(详见下文脚本)。图表中的每个点代表单条样本,鼠标悬停时可显示对应文本与标签。每个点的颜色为该样本所对应情绪的RGB值加权平均值。 ## 数据分析 如需了解更多文档与命令行参数说明,请参阅各脚本: * `python3 -m analyze_data`:获取数据的高级统计信息及情绪标注间的相关性 * `python3 -m extract_words`:基于对数优势比,获取与各情绪显著相关(相较于其他情绪)的词汇 * `python3 -m ppca`:对数据运行PPCA分析[(Cowen et al., 2019)](https://www.nature.com/articles/s41562-019-0533-6)并生成可视化图表 ### 教程 我们发布了一份[详细教程](https://github.com/tensorflow/models/blob/master/research/seq_flow_lite/demo/colab/emotion_colab.ipynb),用于训练神经情绪预测模型。在该教程中,我们将演示如何使用GoEmotions训练TensorFlow模型库(TensorFlow Model Garden)中的可用架构,并将其应用于基于对话文本推荐表情符号的任务。 ## 引用 如果您将此代码用于学术发表,请引用原论文: @inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} } ## 联系方式 [Dora Demszky](https://nlp.stanford.edu/~ddemszky/index.html) ## 免责声明 - 我们注意到本数据集存在偏见,无法代表全球多样性 - 我们注意到本数据集包含潜在不当内容 - 数据中存在的潜在偏见包括:红迪网平台及其用户群体固有的偏见、数据过滤所用的冒犯性/粗俗词汇列表、对冒犯性身份标签的评估中存在的固有或无意识偏见,以及所有标注者均为来自印度的英语母语使用者。上述因素均可能影响模型的标注精度与召回率 - 用于情感标注的情绪试点模型,基于研究团队审核的样本训练而成 - 任何使用本数据集的用户均应知晓该数据集的上述局限性 ## 数据集元数据 下表是本数据集能够被Google Dataset Search等搜索引擎索引的必要信息: <div itemscope itemtype="http://schema.org/Dataset"> <table> <tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">GoEmotions</code></td> </tr> <tr> <td>description</td> <td><code itemprop="description">GoEmotions contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The emotion categories are _admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise_.</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://github.com/google-research/google-research/tree/master/goemotions</code></td> </tr> <tr> <td>citation</td> <td><code itemprop="citation">https://identifiers.org/arxiv:2005.00547</code></td> </tr> <tr> <td>provider</td> <td> <div itemscope="" itemtype="http://schema.org/Organization" itemprop="provider"> <table> <tbody><tr> <th>property</th> <th>value</th> </tr> <tr> <td>name</td> <td><code itemprop="name">Google</code></td> </tr> <tr> <td>sameAs</td> <td><code itemprop="sameAs">https://en.wikipedia.org/wiki/Google</code></td> </tr> </tbody></table> </div> </td> </tr> </table> </div>
提供机构:
Thaslima
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作