Non-Residual-Prompting/C2Gen
收藏Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Non-Residual-Prompting/C2Gen
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license:
- cc-by-sa-4.0
size_categories:
- <100K
task_categories:
- text-generation
---
# Dataset Card for Contextualized CommonGen(C2Gen)
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Initial Data Collection and Normalization](#initial-cata-collection-and-normalization)
- [Licensing Information](#licensing-information)
## Dataset Description
- **Repository:** [Non-Residual Prompting](https://github.com/FreddeFrallan/Non-Residual-Prompting)
- **Paper:** [Fine-Grained Controllable Text Generation Using Non-Residual Prompting](https://aclanthology.org/2022.acl-long.471)
- **Point of Contact:** [Fredrik Carlsson](mailto:Fredrik.Carlsson@ri.se)
### Dataset Summary
CommonGen [Lin et al., 2020](https://arxiv.org/abs/1911.03705) is a dataset for the constrained text generation task of word inclusion. But the task does not allow to include context. Therefore, to complement CommonGen, we provide an extended test set C2Gen [Carlsson et al., 2022](https://aclanthology.org/2022.acl-long.471) where an additional context is provided for each set of target words. The task is therefore reformulated to both generate commonsensical text which include the given words, and also have the generated text adhere to the given context.
### Languages
English
## Dataset Structure
### Data Instances
{"Context": "The show came on the television with people singing. The family all gathered to watch. They all became silent when the show came on.", "Words": ["follow", "series", "voice"]}
### Data Fields
- context: the generated text by the model should adhere to this text
- words: the words that should be included in the generated continuation
### Data Splits
Test
## Dataset Creation
### Curation Rationale
C2Gen was created because the authors of the paper believed that the task formulation of CommonGen is too narrow, and that it needlessly incentivizes researchers
to focus on methods that do not support context. Which is orthogonal to their belief that many application areas necessitates the consideration of surrounding context. Therefore, to complement CommonGen, they provide an extended test set where an additional context is provided for each set of target words.
### Initial Data Collection and Normalization
The dataset was constructed with the help the crowd sourcing platform MechanicalTurk. Each remaining concept set manually received a textual context. To assure the quality of the data generation, only native English speakers with a recorded high acceptance were allowed to participate. Finally, all contexts were manually verified, and fixed in terms of typos and poor quality. Furthermore we want to raise awareness that C2GEN can contain personal data or offensive content. If you would encounter such a sample, please reach out to us.
## Licensing Information
license: cc-by-sa-4.0
提供机构:
Non-Residual-Prompting
原始信息汇总
数据集概述
数据集描述
数据集总结
C2Gen 是一个扩展测试集,旨在补充 CommonGen 数据集,为每个目标词集提供额外的上下文。任务要求生成的文本不仅包含给定词汇,还需符合提供的上下文。
语言
数据集语言为英语。
数据集结构
数据实例
数据实例包括上下文和目标词汇,例如:
{"Context": "The show came on the television with people singing. The family all gathered to watch. They all became silent when the show came on.", "Words": ["follow", "series", "voice"]}
数据字段
- context: 模型生成的文本应遵循此文本。
- words: 应包含在生成文本中的词汇。
数据分割
数据集仅包含测试集。
数据集创建
精选理由
C2Gen 的创建是因为作者认为 CommonGen 的任务定义过于狭窄,不支持上下文处理,而许多应用领域需要考虑周围上下文。因此,提供了一个扩展测试集,其中为目标词集提供了额外的上下文。
初始数据收集与规范化
数据集通过 MechanicalTurk 平台构建,每个概念集都手动添加了文本上下文。只有高接受率的母语英语用户参与数据生成,所有上下文都经过手动验证和修正。
许可信息
数据集的许可证为 cc-by-sa-4.0。



