five

CreativeLang/scope_simile_generation

收藏
Hugging Face2023-07-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/CreativeLang/scope_simile_generation
下载链接
链接失效反馈
官方服务:
资源简介:
SCOPE Simile数据集旨在从字面描述性句子中生成比喻。该数据集通过两步法创建:首先将自标记的比喻转换为字面句子,使用结构化的常识知识;然后在这些[字面句子, 比喻]对上微调seq2seq模型以生成比喻。数据集从Reddit的WRITINGPROMPTS和FUNNY子版块收集,包含87,843个人类编写的自标记比喻,其中82,697个用于训练,5,146个用于验证。为将比喻转换为字面版本,使用COMET框架识别比喻中隐含的共享属性,并选择前5个常识属性形成可能的字面版本,然后使用GPT预训练语言模型的困惑度分数进行排名。此外,使用语法错误纠正模型纠正引入的任何错误。

SCOPE Simile数据集旨在从字面描述性句子中生成比喻。该数据集通过两步法创建:首先将自标记的比喻转换为字面句子,使用结构化的常识知识;然后在这些[字面句子, 比喻]对上微调seq2seq模型以生成比喻。数据集从Reddit的WRITINGPROMPTS和FUNNY子版块收集,包含87,843个人类编写的自标记比喻,其中82,697个用于训练,5,146个用于验证。为将比喻转换为字面版本,使用COMET框架识别比喻中隐含的共享属性,并选择前5个常识属性形成可能的字面版本,然后使用GPT预训练语言模型的困惑度分数进行排名。此外,使用语法错误纠正模型纠正引入的任何错误。
提供机构:
CreativeLang
原始信息汇总

数据集概述

  • 名称: SCOPE Simile
  • 目的: 用于生成比喻句,从字面描述句中产生比喻。
  • 方法: 采用两步法,首先将自标记的比喻转换为字面句,然后使用seq2seq模型在这些[字面句, 比喻]对上进行微调以生成比喻。
  • 数据来源: 数据集收集自Reddit的WRITINGPROMPTS和FUNNY子论坛,通过搜索短语“like a”来识别比喻。
  • 数据规模: 包含87,843个人类编写的自标记比喻,其中82,697个用于训练,5,146个用于验证。
  • 转换方法: 使用COMET框架识别比喻中的共享属性,并选择前5个常识属性来形成可能的字面版本,然后使用GPT模型的困惑度分数进行排名。
  • 语言: 英语
  • 创建时间: 2020年

数据集详情

引用信息

若使用此数据集,请引用以下文献:

@inproceedings{chakrabarty-etal-2020-generating, title = "Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation", author = "Chakrabarty, Tuhin and Muresan, Smaranda and Peng, Nanyun", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.524", pages = "6455--6469", abstract = "Literary tropes, from poetry to stories, are at the crux of human imagination and communication. Figurative language such as a simile go beyond plain expressions to give readers new insights and inspirations. In this paper, we tackle the problem of simile generation. Generating a simile requires proper understanding for effective mapping of properties between two concepts. To this end, we first propose a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge. We then propose to fine-tune a pre-trained sequence to sequence model, BART (Lewis et al 2019), on the literal-simile pairs to gain generalizability, so that we can generate novel similes given a literal sentence. Experiments show that our approach generates 88{%} novel similes that do not share properties with the training data. Human evaluation on an independent set of literal statements shows that our model generates similes better than two literary experts 37{%} of the time when compared pairwise. We also show how replacing literal sentences with similes from our best model in machine-generated stories improves evocativeness and leads to better acceptance by human judges.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作