five

antoniomenezes/go_emotions_ptbr

收藏
Hugging Face2022-11-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/antoniomenezes/go_emotions_ptbr
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - found language: - en - pt license: - apache-2.0 multilinguality: - 2 languages size_categories: - 100K<n<1M - 10K<n<100K source_datasets: - modified task_categories: - text-classification task_ids: - multi-class-classification - multi-label-classification paperswithcode_id: goemotions pretty_name: GoEmotions configs: - raw - simplified tags: - emotion dataset_info: - config_name: raw features: - name: text dtype: string - name: id dtype: string - name: author dtype: string - name: subreddit dtype: string - name: link_id dtype: string - name: parent_id dtype: string - name: created_utc dtype: float32 - name: rater_id dtype: int32 - name: example_very_unclear dtype: bool - name: admiration dtype: int32 - name: amusement dtype: int32 - name: anger dtype: int32 - name: annoyance dtype: int32 - name: approval dtype: int32 - name: caring dtype: int32 - name: confusion dtype: int32 - name: curiosity dtype: int32 - name: desire dtype: int32 - name: disappointment dtype: int32 - name: disapproval dtype: int32 - name: disgust dtype: int32 - name: embarrassment dtype: int32 - name: excitement dtype: int32 - name: fear dtype: int32 - name: gratitude dtype: int32 - name: grief dtype: int32 - name: joy dtype: int32 - name: love dtype: int32 - name: nervousness dtype: int32 - name: optimism dtype: int32 - name: pride dtype: int32 - name: realization dtype: int32 - name: relief dtype: int32 - name: remorse dtype: int32 - name: sadness dtype: int32 - name: surprise dtype: int32 - name: neutral dtype: int32 - name: texto dtype: string splits: - name: train num_bytes: 55343630 num_examples: 211225 download_size: 42742918 dataset_size: 55343630 - config_name: simplified features: - name: text dtype: string - name: labels sequence: class_label: names: 0: admiration 1: amusement 2: anger 3: annoyance 4: approval 5: caring 6: confusion 7: curiosity 8: desire 9: disappointment 10: disapproval 11: disgust 12: embarrassment 13: excitement 14: fear 15: gratitude 16: grief 17: joy 18: love 19: nervousness 20: optimism 21: pride 22: realization 23: relief 24: remorse 25: sadness 26: surprise 27: neutral - name: id dtype: string splits: - name: train num_bytes: 4224198 num_examples: 43410 - name: validation num_bytes: 527131 num_examples: 5426 - name: test num_bytes: 524455 num_examples: 5427 download_size: 4394818 dataset_size: 5275784 --- # Dataset Card for GoEmotions ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/google-research/google-research/tree/master/goemotions - **Repository:** https://github.com/google-research/google-research/tree/master/goemotions - **Paper:** https://arxiv.org/abs/2005.00547 - **Leaderboard:** - **Point of Contact:** [Dora Demszky](https://nlp.stanford.edu/~ddemszky/index.html) ### Dataset Summary The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The raw data is included as well as the smaller, simplified version of the dataset with predefined train/val/test splits. ### Supported Tasks and Leaderboards This dataset is intended for multi-class, multi-label emotion classification. ### Languages The data is in English and Brazilian Portuguese (translated by Google Translator). ## Dataset Structure ### Data Instances Each instance is a reddit comment with a corresponding ID and one or more emotion annotations (or neutral). ### Data Fields The simplified configuration includes: - `text`: the reddit comment - `texto`: the reddit comment in portuguese - `labels`: the emotion annotations - `comment_id`: unique identifier of the comment (can be used to look up the entry in the raw dataset) In addition to the above, the raw data includes: * `author`: The Reddit username of the comment's author. * `subreddit`: The subreddit that the comment belongs to. * `link_id`: The link id of the comment. * `parent_id`: The parent id of the comment. * `created_utc`: The timestamp of the comment. * `rater_id`: The unique id of the annotator. * `example_very_unclear`: Whether the annotator marked the example as being very unclear or difficult to label (in this case they did not choose any emotion labels). In the raw data, labels are listed as their own columns with binary 0/1 entries rather than a list of ids as in the simplified data. ### Data Splits The simplified data includes a set of train/val/test splits with 43,410, 5426, and 5427 examples respectively. ## Dataset Creation ### Curation Rationale From the paper abstract: > Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. ### Source Data #### Initial Data Collection and Normalization Data was collected from Reddit comments via a variety of automated methods discussed in 3.1 of the paper. #### Who are the source language producers? English-speaking Reddit users. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? Annotations were produced by 3 English-speaking crowdworkers in India. ### Personal and Sensitive Information This dataset includes the original usernames of the Reddit users who posted each comment. Although Reddit usernames are typically disasociated from personal real-world identities, this is not always the case. It may therefore be possible to discover the identities of the individuals who created this content in some cases. ## Considerations for Using the Data ### Social Impact of Dataset Emotion detection is a worthwhile problem which can potentially lead to improvements such as better human/computer interaction. However, emotion detection algorithms (particularly in computer vision) have been abused in some cases to make erroneous inferences in human monitoring and assessment applications such as hiring decisions, insurance pricing, and student attentiveness (see [this article](https://www.unite.ai/ai-now-institute-warns-about-misuse-of-emotion-detection-software-and-other-ethical-issues/)). ### Discussion of Biases From the authors' github page: > Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India. All these likely affect labelling, precision, and recall for a trained model. Anyone using this dataset should be aware of these limitations of the dataset. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Researchers at Amazon Alexa, Google Research, and Stanford. See the [author list](https://arxiv.org/abs/2005.00547). ### Licensing Information The GitHub repository which houses this dataset has an [Apache License 2.0](https://github.com/google-research/google-research/blob/master/LICENSE). ### Citation Information @inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} } ### Contributions Thanks to [@joeddav](https://github.com/joeddav) for adding this dataset. Thanks to [@antoniomenezes](https://github.com/antoniomenezes) for extending this dataset.
提供机构:
antoniomenezes
原始信息汇总

数据集概述

基本信息

  • 名称: GoEmotions
  • 语言: 英语 (en), 葡萄牙语 (pt)
  • 许可证: Apache-2.0
  • 多语言性: 支持2种语言
  • 大小:
    • 100K<n<1M
    • 10K<n<100K

数据集结构

  • 任务类别: 文本分类
  • 任务ID:
    • 多类分类
    • 多标签分类

数据集创建

  • 注释创建者: 众包
  • 语言创建者: 发现

数据集详细信息

配置

  • 配置名称: raw, simplified
raw配置
  • 特征:

    • text: 字符串
    • id: 字符串
    • author: 字符串
    • subreddit: 字符串
    • link_id: 字符串
    • parent_id: 字符串
    • created_utc: float32
    • rater_id: int32
    • example_very_unclear: bool
    • 情感类别 (如admiration, amusement等): int32
  • 分割:

    • train: 211225个实例, 55343630字节
    • 下载大小: 42742918字节
    • 数据集大小: 55343630字节
simplified配置
  • 特征:

    • text: 字符串
    • labels: 序列, 包含27种情感类别或Neutral
    • id: 字符串
  • 分割:

    • train: 43410个实例, 4224198字节
    • validation: 5426个实例, 527131字节
    • test: 5427个实例, 524455字节
    • 下载大小: 4394818字节
    • 数据集大小: 5275784字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作