hippocorpus
收藏魔搭社区2025-07-03 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/hippocorpus
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Hippocorpus](https://msropendata.com/datasets/0a83fb6f-a759-4a17-aaa2-fbac84577318)
- **Repository:** [Hippocorpus](https://msropendata.com/datasets/0a83fb6f-a759-4a17-aaa2-fbac84577318)
- **Paper:** [Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models](http://erichorvitz.com/cognitive_studies_narrative.pdf)
- **Point of Contact:** [Eric Horvitz](mailto:horvitz@microsoft.com)
### Dataset Summary
To examine the cognitive processes of remembering and imagining and their traces in language, we introduce Hippocorpus, a dataset of 6,854 English diary-like short stories about recalled and imagined events. Using a crowdsourcing framework, we first collect recalled stories and summaries from workers, then provide these summaries to other workers who write imagined stories. Finally, months later, we collect a retold version of the recalled stories from a subset of recalled authors. Our dataset comes paired with author demographics (age, gender, race), their openness to experience, as well as some variables regarding the author's relationship to the event (e.g., how personal the event is, how often they tell its story, etc.).
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The dataset can be found in English
## Dataset Structure
[More Information Needed]
### Data Instances
[More Information Needed]
### Data Fields
This CSV file contains all the stories in Hippcorpus v2 (6854 stories)
These are the columns in the file:
- `AssignmentId`: Unique ID of this story
- `WorkTimeInSeconds`: Time in seconds that it took the worker to do the entire HIT (reading instructions, storywriting, questions)
- `WorkerId`: Unique ID of the worker (random string, not MTurk worker ID)
- `annotatorAge`: Lower limit of the age bucket of the worker. Buckets are: 18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55+
- `annotatorGender`: Gender of the worker
- `annotatorRace`: Race/ethnicity of the worker
- `distracted`: How distracted were you while writing your story? (5-point Likert)
- `draining`: How taxing/draining was writing for you emotionally? (5-point Likert)
- `frequency`: How often do you think about or talk about this event? (5-point Likert)
- `importance`: How impactful, important, or personal is this story/this event to you? (5-point Likert)
- `logTimeSinceEvent`: Log of time (days) since the recalled event happened
- `mainEvent`: Short phrase describing the main event described
- `memType`: Type of story (recalled, imagined, retold)
- `mostSurprising`: Short phrase describing what the most surpring aspect of the story was
- `openness`: Continuous variable representing the openness to experience of the worker
- `recAgnPairId`: ID of the recalled story that corresponds to this retold story (null for imagined stories). Group on this variable to get the recalled-retold pairs.
- `recImgPairId`: ID of the recalled story that corresponds to this imagined story (null for retold stories). Group on this variable to get the recalled-imagined pairs.
- `similarity`: How similar to your life does this event/story feel to you? (5-point Likert)
- `similarityReason`: Free text annotation of similarity
- `story`: Story about the imagined or recalled event (15-25 sentences)
- `stressful`: How stressful was this writing task? (5-point Likert)
- `summary`: Summary of the events in the story (1-3 sentences)
- `timeSinceEvent`: Time (num. days) since the recalled event happened
### Data Splits
[More Information Needed]
## Dataset Creation
[More Information Needed]
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
[More Information Needed]
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
[More Information Needed]
### Dataset Curators
The dataset was initially created by Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, James W. Pennebaker, during work done at Microsoft Research.
### Licensing Information
Hippocorpus is distributed under the [Open Use of Data Agreement v1.0](https://msropendata-web-api.azurewebsites.net/licenses/f1f352a6-243f-4905-8e00-389edbca9e83/view).
### Citation Information
```
@inproceedings{sap-etal-2020-recollection,
title = "Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models",
author = "Sap, Maarten and
Horvitz, Eric and
Choi, Yejin and
Smith, Noah A. and
Pennebaker, James",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.178",
doi = "10.18653/v1/2020.acl-main.178",
pages = "1970--1978",
abstract = "We investigate the use of NLP as a measure of the cognitive processes involved in storytelling, contrasting imagination and recollection of events. To facilitate this, we collect and release Hippocorpus, a dataset of 7,000 stories about imagined and recalled events. We introduce a measure of narrative flow and use this to examine the narratives for imagined and recalled events. Additionally, we measure the differential recruitment of knowledge attributed to semantic memory versus episodic memory (Tulving, 1972) for imagined and recalled storytelling by comparing the frequency of descriptions of general commonsense events with more specific realis events. Our analyses show that imagined stories have a substantially more linear narrative flow, compared to recalled stories in which adjacent sentences are more disconnected. In addition, while recalled stories rely more on autobiographical events based on episodic memory, imagined stories express more commonsense knowledge based on semantic memory. Finally, our measures reveal the effect of narrativization of memories in stories (e.g., stories about frequently recalled memories flow more linearly; Bartlett, 1932). Our findings highlight the potential of using NLP tools to study the traces of human cognition in language.",
}
```
### Contributions
Thanks to [@manandey](https://github.com/manandey) for adding this dataset.
# 数据集卡片:[数据集名称]
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准榜单](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集构建者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **数据集主页**:[Hippocorpus](https://msropendata.com/datasets/0a83fb6f-a759-4a17-aaa2-fbac84577318)
- **代码仓库**:[Hippocorpus](https://msropendata.com/datasets/0a83fb6f-a759-4a17-aaa2-fbac84577318)
- **相关论文**:[《回忆与想象:通过神经语言模型探索人类记忆与认知》](http://erichorvitz.com/cognitive_studies_narrative.pdf)
- **联系人**:[Eric Horvitz](mailto:horvitz@microsoft.com)
### 数据集概述
为探究记忆与想象的认知过程及其在语言中的痕迹,我们构建了Hippocorpus数据集,该数据集包含6854篇类似日记体裁的英文短篇故事,涵盖回忆事件与想象事件两类内容。我们通过众包框架开展数据收集:首先向众包工作者收集回忆故事及其摘要,随后将这些摘要提供给另一组众包工作者,由其撰写对应想象事件的故事;数月后,我们从部分原回忆故事的作者处收集了对应事件的重述版本。本数据集附带作者的人口统计学信息(年龄、性别、种族)、开放性(openness to experience)人格特质,以及描述作者与事件关系的多项变量(例如事件的个人关联程度、讲述该故事的频率等)。
### 支持任务与基准榜单
[更多信息待补充]
### 语言
本数据集语言为英语
## 数据集结构
[更多信息待补充]
### 数据实例
[更多信息待补充]
### 数据字段
本CSV(逗号分隔值)文件包含Hippocorpus v2版本的全部故事(共6854篇)。
文件内的字段如下:
- `AssignmentId`:本故事的唯一标识
- `WorkTimeInSeconds`:众包任务总耗时(单位:秒,涵盖阅读任务说明、撰写故事、回答问卷全流程)
- `WorkerId`:标注者唯一标识(随机字符串,非亚马逊机械 Turk(MTurk)标注者ID)
- `annotatorAge`:标注者年龄区间下限,年龄区间划分如下:18-24岁、25-29岁、30-34岁、35-39岁、40-44岁、45-49岁、50-54岁、55岁及以上
- `annotatorGender`:标注者性别
- `annotatorRace`:标注者种族/族裔
- `distracted`:撰写故事时的分心程度(5级李克特量表)
- `draining`:撰写故事的情感消耗程度(5级李克特量表)
- `frequency`:思考或谈论该事件的频率(5级李克特量表)
- `importance`:该故事/事件对标注者的影响程度、重要性或个人关联度(5级李克特量表)
- `logTimeSinceEvent`:事件发生后时长的对数转换值(单位:天)
- `mainEvent`:核心事件简述
- `memType`:故事类型(回忆、想象、重述)
- `mostSurprising`:故事中最令人惊讶的部分简述
- `openness`:代表标注者开放性(openness to experience)的连续变量
- `recAgnPairId`:对应重述故事的原回忆故事ID(想象故事此字段为空),按此字段分组可获取回忆-重述配对样本
- `recImgPairId`:对应想象故事的原回忆故事ID(重述故事此字段为空),按此字段分组可获取回忆-想象配对样本
- `similarity`:该事件/故事与标注者生活的相似程度(5级李克特量表)
- `similarityReason`:关于相似性的自由文本注释
- `story`:关于想象或回忆事件的故事(15至25个句子)
- `stressful`:撰写任务的压力程度(5级李克特量表)
- `summary`:故事事件摘要(1至3个句子)
- `timeSinceEvent`:事件发生后时长(单位:天)
### 数据划分
[更多信息待补充]
## 数据集构建
[更多信息待补充]
### 构建初衷
[更多信息待补充]
### 源数据
[更多信息待补充]
#### 初始数据收集与标准化
[更多信息待补充]
#### 源语言生成者是谁?
[更多信息待补充]
### 标注信息
[更多信息待补充]
#### 标注流程
[更多信息待补充]
#### 标注者是谁?
[更多信息待补充]
### 个人与敏感信息
[更多信息待补充]
## 数据集使用注意事项
[更多信息待补充]
### 数据集的社会影响
[更多信息待补充]
### 偏差讨论
[更多信息待补充]
### 其他已知局限性
[更多信息待补充]
## 附加信息
[更多信息待补充]
### 数据集构建者
本数据集最初由Maarten Sap、Eric Horvitz、Yejin Choi、Noah A. Smith、James W. Pennebaker在微软研究院工作期间创建。
### 授权信息
Hippocorpus采用[数据开放使用协议v1.0(Open Use of Data Agreement v1.0)](https://msropendata-web-api.azurewebsites.net/licenses/f1f352a6-243f-4905-8e00-389edbca9e83/view)进行分发。
### 引用信息
@inproceedings{sap-etal-2020-recollection,
title = "Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models",
author = "Sap, Maarten and
Horvitz, Eric and
Choi, Yejin and
Smith, Noah A. and
Pennebaker, James",
booktitle = "第58届国际计算语言学协会年会论文集",
month = jul,
year = "2020",
address = "线上",
publisher = "国际计算语言学协会",
url = "https://www.aclweb.org/anthology/2020.acl-main.178",
doi = "10.18653/v1/2020.acl-main.178",
pages = "1970--1978",
abstract = "我们探讨了将自然语言处理(Natural Language Processing,NLP)作为衡量讲故事相关认知过程的方法,对比了事件的想象与回忆过程。为推动相关研究,我们收集并发布了Hippocorpus数据集,该数据集包含7000篇关于想象与回忆事件的故事。我们提出了一种叙事流畅性度量方法,并借此分析想象与回忆事件叙事的差异。此外,我们对比了一般常识性事件描述与更具体的现实性事件描述的频率,以此衡量想象与回忆叙事中语义记忆与情景记忆(Tulving, 1972)的差异化调用。分析结果显示,与相邻句子更具割裂感的回忆故事相比,想象故事的叙事流畅性显著更强。此外,尽管回忆故事更多依赖基于情景记忆的自传性事件,想象故事则更多表达基于语义记忆的常识性知识。最后,我们的度量方法揭示了记忆叙事化的效应(例如,频繁回忆的记忆叙事流畅性更强;Bartlett, 1932)。我们的研究结果凸显了利用自然语言处理工具研究人类认知在语言中的痕迹的潜力。",
}
### 贡献致谢
感谢[@manandey](https://github.com/manandey)为本数据集完成收录工作。
提供机构:
maas
创建时间:
2025-05-28



