five

peixiang/pec

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/peixiang/pec
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language_creators: - found language: - en license: - gpl-3.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-generation - fill-mask - text-retrieval task_ids: - dialogue-modeling - utterance-retrieval paperswithcode_id: pec pretty_name: Persona-Based Empathetic Conversational dataset_info: - config_name: happy features: - name: personas sequence: string - name: context sequence: string - name: context_speakers sequence: string - name: response dtype: string - name: response_speaker dtype: string splits: - name: train num_bytes: 643196978 num_examples: 157195 - name: test num_bytes: 92003042 num_examples: 22730 - name: validation num_bytes: 81132088 num_examples: 19829 download_size: 252434681 dataset_size: 816332108 - config_name: offmychest features: - name: personas sequence: string - name: context sequence: string - name: context_speakers sequence: string - name: response dtype: string - name: response_speaker dtype: string splits: - name: train num_bytes: 518616402 num_examples: 123968 - name: test num_bytes: 64173390 num_examples: 15324 - name: validation num_bytes: 66675909 num_examples: 16004 download_size: 252434681 dataset_size: 649465701 - config_name: all features: - name: personas sequence: string - name: context sequence: string - name: context_speakers sequence: string - name: response dtype: string - name: response_speaker dtype: string splits: - name: train num_bytes: 1162655628 num_examples: 281163 - name: test num_bytes: 156310498 num_examples: 38054 - name: validation num_bytes: 147940164 num_examples: 35833 download_size: 252434681 dataset_size: 1466906290 config_names: - all - happy - offmychest --- # Dataset Card for PEC ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [PEC repository](https://github.com/zhongpeixiang/PEC) - **Paper:** [Towards Persona-Based Empathetic Conversational Models](https://www.aclweb.org/anthology/2020.emnlp-main.531/) - **Point of Contact:** [Peixiang Zhong](mailto:zhongpeixiang@gmail.com) ### Dataset Summary The PEC dataset is an English-language dataset of open-domain conversations gathered from two subreddits on Reddit, i.e., happy and offmychest. PEC has around 350K persona-based empathetic conversations. Each utterance is associated with a speaker, and each speaker has a persona of multiple persona sentences. The conversations in PEC are more empathetic than casual conversations. The conversations in the happy domain are mostly positive, whereas the conversations in the offmychest domain are mostly negative. ### Supported Tasks and Leaderboards - `dialogue-modeling`, `utterance-retrieval`: this dataset can be used to train a generative or retrieval-based conversational model. ### Languages English ## Dataset Structure ### Data Instances A typical data example comprises a list of context utterances, a list of context speakers, a response to the context, the response speaker and the persona of the response speaker. An example from PEC looks as follows: ``` {'context': ['found out this morning i got a job promotion ! ! !'], 'context_speakers': ['HeWentToJared91'], 'personas': [ "i ca n't stand working in the ugli .", 'i ’ve always liked my eyes except for the fact that they ca n’t shoot lasers', 'i feel really bad about myself as a person right now , and i could really use a hand .', 'i drank a coffee , and it just made me feel even more exhausted .', 'i want a natsuki t shirt', "i 've dealt with depression in the past .", 'i love red dead 2'], 'response': "you look like a nice person ! we 're proud of you , and i bet you earned that promotion !", 'response_speaker': 'tylock'} ``` ### Data Fields - `context`: a list of strings, each string denotes a context utterance. - `context_speakers`: a list of strings, each string denotes a speaker. - `response`: a string denoting the response to the `context`. - `response_speaker`: a string denoting the speaker of `response`. - `personas`: a list of strings, each string denotes a persona sentence of `response_speaker`. ### Data Splits The data is split into a training, validation and test set for each of the three domains. Note that the *all* domain is the concatenation of the *happy* and *offmychest* domains. | domain | train | validation | test | |------------|-------:|-----------:|------:| | happy | 157195 | 19829 | 22730 | | offmychest | 123968 | 16004 | 15324 | | all | 281163 | 35833 | 38054 | ## Dataset Creation ### Curation Rationale PEC was built to provide a testbed for machines to learn persona-based empathetic responding. In our empirical analysis, we found that different personas have different styles of empathetic responding. This dataset can also be used to investigate the link between persona and empathy in human conversations. According to our human assessment, the conversations on the happy and offmychest subreddits are significantly more empathetic than casual conversations. ### Source Data #### Initial Data Collection and Normalization The data was obtained via the [pushshift API](https://pushshift.io/using-bigquery-with-reddit-data/) via Google BigQuery. #### Who are the source language producers? The language producers are users of the [r/happy](https://www.reddit.com/r/happy/), and [r/offmychest](https://www.reddit.com/r/offmychest/) subreddits between 2012 and 2020. No further demographic information was available from the data source. ### Annotations #### Annotation process The dataset does not contain any additional annotations. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information The dataset includes the speaker IDs of users on *happy* and *offmychest* subreddits. ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to help develop more personalised and empathetic conversational systems, which is an important milestone towards truly human-like conversational agents. ### Discussion of Biases [More Information Needed] ### Other Known Limitations A small portion of the dataset has the issues of sexism, hate, and harassment. The persona sentences are noisy. ## Additional Information ### Dataset Curators The dataset was initially created by Peixiang Zhong, Chen Zhang, Hao Wang, Yong Liu, and Chunyan Miao, jointly done at Nanyang Technological University and Alibaba Group. ### Licensing Information The licensing status of the dataset hinges on the legal status of the [Pushshift.io](https://files.pushshift.io/reddit/) data which is unclear. ### Citation Information ``` @inproceedings{zhong-etal-2020-towards, title = "Towards Persona-Based Empathetic Conversational Models", author = "Zhong, Peixiang and Zhang, Chen and Wang, Hao and Liu, Yong and Miao, Chunyan", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", year = "2020", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.531", pages = "6556--6566" } ``` ### Contributions Thanks to [@zhongpeixiang](https://github.com/zhongpeixiang) for adding this dataset.
提供机构:
peixiang
原始信息汇总

数据集概述

数据集名称

  • 名称: Persona-Based Empathetic Conversational (PEC)

语言

  • 语言: 英语

许可证

  • 许可证: GPL-3.0

多语言性

  • 多语言性: 单语种

大小分类

  • 大小: 100K<n<1M

源数据集

  • 源数据集: 原始数据

任务类别

  • 任务类别: 文本生成, 填充掩码, 文本检索

任务ID

  • 任务ID: 对话建模, 话语检索

数据集信息

  • 配置名称: happy, offmychest, all
  • 特征:
    • personas: 字符串序列
    • context: 字符串序列
    • context_speakers: 字符串序列
    • response: 字符串
    • response_speaker: 字符串
  • 数据分割:
    • train:
      • happy: 157195 示例, 643196978 字节
      • offmychest: 123968 示例, 518616402 字节
      • all: 281163 示例, 1162655628 字节
    • test:
      • happy: 22730 示例, 92003042 字节
      • offmychest: 15324 示例, 64173390 字节
      • all: 38054 示例, 156310498 字节
    • validation:
      • happy: 19829 示例, 81132088 字节
      • offmychest: 16004 示例, 66675909 字节
      • all: 35833 示例, 147940164 字节
  • 下载大小: 252434681 字节
  • 数据集大小:
    • happy: 816332108 字节
    • offmychest: 649465701 字节
    • all: 1466906290 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作