allenai/hippocorpus
收藏数据集描述
数据集概述
Hippocorpus是一个包含6,854个英语日记式短故事的数据集,这些故事涉及回忆和想象的事件。通过众包框架,首先收集工人的回忆故事和摘要,然后提供这些摘要给其他工人编写想象的故事。最后,几个月后,从一部分回忆作者那里收集回忆故事的再讲述版本。数据集还包括作者的 demographics(年龄、性别、种族)、他们的开放性体验,以及作者与事件关系的一些变量(例如,事件对作者的个人程度、他们讲述故事的频率等)。
支持的任务和排行榜
[更多信息需要]
语言
数据集包含英语。
数据集结构
数据实例
[更多信息需要]
数据字段
数据集包含以下字段:
AssignmentId: 故事的唯一IDWorkTimeInSeconds: 工人完成整个任务(阅读说明、编写故事、回答问题)所花费的时间(秒)WorkerId: 工人的唯一ID(随机字符串,非MTurk工人ID)annotatorAge: 工人的年龄桶的下限(桶:18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55+)annotatorGender: 工人的性别annotatorRace: 工人的种族/民族distracted: 编写故事时的分心程度(5点Likert量表)draining: 编写故事对情感的消耗程度(5点Likert量表)frequency: 思考或谈论此事件的频率(5点Likert量表)importance: 故事/事件对作者的影响、重要性或个人程度(5点Likert量表)logTimeSinceEvent: 回忆事件发生以来的时间(天)的对数mainEvent: 描述主要事件的短语memType: 故事类型(回忆、想象、再讲述)mostSurprising: 故事中最令人惊讶的方面的短语openness: 代表工人开放性体验的连续变量recAgnPairId: 对应于此再讲述故事的回忆故事的ID(想象故事为空)。按此变量分组以获取回忆-再讲述对。recImgPairId: 对应于此想象故事的回忆故事的ID(再讲述故事为空)。按此变量分组以获取回忆-想象对。similarity: 此事件/故事对作者生活的相似程度(5点Likert量表)similarityReason: 相似性的自由文本注释story: 关于想象或回忆事件的故事(15-25句)stressful: 此编写任务的压力程度(5点Likert量表)summary: 故事中事件的摘要(1-3句)timeSinceEvent: 回忆事件发生以来的时间(天)
数据分割
[更多信息需要]
数据集创建
策划理由
[更多信息需要]
源数据
[更多信息需要]
初始数据收集和规范化
[更多信息需要]
源语言生产者是谁?
[更多信息需要]
注释
[更多信息需要]
注释过程
[更多信息需要]
注释者是谁?
[更多信息需要]
个人和敏感信息
[更多信息需要]
使用数据的注意事项
数据集的社会影响
[更多信息需要]
偏见的讨论
[更多信息需要]
其他已知限制
[更多信息需要]
附加信息
数据集策展人
数据集最初由Maarten Sap、Eric Horvitz、Yejin Choi、Noah A. Smith、James W. Pennebaker在微软研究院工作期间创建。
许可信息
Hippocorpus在开放数据使用协议v1.0下发布。
引用信息
@inproceedings{sap-etal-2020-recollection, title = "Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models", author = "Sap, Maarten and Horvitz, Eric and Choi, Yejin and Smith, Noah A. and Pennebaker, James", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.178", doi = "10.18653/v1/2020.acl-main.178", pages = "1970--1978", abstract = "We investigate the use of NLP as a measure of the cognitive processes involved in storytelling, contrasting imagination and recollection of events. To facilitate this, we collect and release Hippocorpus, a dataset of 7,000 stories about imagined and recalled events. We introduce a measure of narrative flow and use this to examine the narratives for imagined and recalled events. Additionally, we measure the differential recruitment of knowledge attributed to semantic memory versus episodic memory (Tulving, 1972) for imagined and recalled storytelling by comparing the frequency of descriptions of general commonsense events with more specific realis events. Our analyses show that imagined stories have a substantially more linear narrative flow, compared to recalled stories in which adjacent sentences are more disconnected. In addition, while recalled stories rely more on autobiographical events based on episodic memory, imagined stories express more commonsense knowledge based on semantic memory. Finally, our measures reveal the effect of narrativization of memories in stories (e.g., stories about frequently recalled memories flow more linearly; Bartlett, 1932). Our findings highlight the potential of using NLP tools to study the traces of human cognition in language.", }
贡献
感谢@manandey添加此数据集。




