Shmoop Corpus

Name: Shmoop Corpus
Creator: 多伦多大学
Published: 2020-01-02 00:06:48
License: 暂无描述

arXiv2020-01-02 更新2024-06-21 收录

下载链接：

http://www.cs.toronto.edu/~makarand/shmoop/

下载链接

链接失效反馈

官方服务：

资源简介：

Shmoop Corpus是由多伦多大学等机构创建的一个包含231个故事的数据集，每个故事都配有详细的章节摘要（共7234章节），这些摘要与故事章节在时间线上是松散对齐的。数据集内容丰富，包括小说、戏剧和短篇故事，旨在通过构建包括填空式问答和简化形式的摘要生成等NLP任务，来提高机器对故事的理解能力。创建过程中，研究人员从Shmoop网站和Project Gutenberg获取故事和摘要，并手动进行章节分割和对齐。该数据集的应用领域主要集中在提高机器阅读理解能力，特别是在处理长文本和理解复杂故事结构方面。

Shmoop Corpus is a dataset developed by the University of Toronto and other institutions, comprising 231 stories. Each story is paired with detailed chapter summaries, with a total of 7234 summaries, and these summaries are loosely aligned with their corresponding story chapters on a timeline. The dataset covers diverse genres including novels, plays and short stories. It aims to enhance machine comprehension of stories by constructing NLP tasks such as fill-in-the-blank question answering and simplified abstract generation. During its creation, researchers collected stories and their summaries from the Shmoop website and Project Gutenberg, and performed manual chapter segmentation and alignment. The primary application fields of this dataset focus on improving machine reading comprehension, especially when processing long texts and understanding complex story structures.

提供机构：

多伦多大学

创建时间：

2019-12-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集