five

community-datasets/gutenberg_time

收藏
Hugging Face2024-06-25 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/community-datasets/gutenberg_time
下载链接
链接失效反馈
官方服务:
资源简介:
Gutenberg Time数据集是一个包含52,183部小说中明确时间引用的干净数据资源。这些小说的全文可通过Project Gutenberg获取。数据集的结构包括Gutenberg ID、小时引用、时间短语、是否模糊、时间短语的起始和结束位置以及上下文信息。数据集的创建目的是通过识别小说中事件发生的时间来捕捉时间的流动。数据集由Allen Kim等人创建,数据来源于Project Gutenberg的小说文本,并由两位作者手动注释。

The Gutenberg Time dataset is a clean data resource containing all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. The dataset structure includes Gutenberg ID, hour reference, time phrase, whether it is ambiguous, the start and end positions of the time phrase, and the context in which the time phrase appears. The dataset was created to capture the flow of time through novels by recognizing the time of day each event in the story takes place. The dataset was created by Allen Kim et al., with data sourced from Project Gutenberg novels and manually annotated by two of the authors.
提供机构:
community-datasets
原始信息汇总

数据集概述

数据集摘要

Gutenberg Time数据集包含52,183本小说的所有明确时间引用,这些小说的全文可通过Project Gutenberg获取。该数据集用于时间分类任务。

支持的任务和排行榜

该数据集支持多类分类任务。

语言

数据集中的文本为英语,主要用于时间分类任务。

数据集结构

数据实例

json { "guten_id": 28999, "hour_reference": 12, "time_phrase": "midday", "is_ambiguous": False, "time_pos_start": 133, "time_pos_end": 134, "tok_context": "Sorrows and trials she had had in plenty in her life , but these the sweetness of her nature had transformed , so that from being things difficult to bear , she had built up with them her own character . Sorrow had increased her own power of sympathy ; out of trials she had learnt patience ; and failure and the gradual sinking of one she had loved into the bottomless slough of evil habit had but left her with an added dower of pity and tolerance . So the past had no sting left , and if iron had ever entered into her soul it now but served to make it strong . She was still young , too ; it was not near sunset with her yet , nor even midday , and the future that , humanly speaking , she counted to be hers was almost dazzling in its brightness . For love had dawned for her again , and no uncertain love , wrapped in the mists of memory , but one that had ripened through liking and friendship and intimacy into the authentic glory . He was in England , too ; she was going back to him . And before very long she would never go away from him again ." }

数据字段

  • guten_id: 字符串类型,Gutenberg ID号。
  • hour_reference: 字符串类型,0到23的小时数。
  • time_phrase: 字符串类型,对应引用小时的短语。
  • is_ambiguous: 布尔类型,是否清楚时间是上午还是下午。
  • time_pos_start: 整数类型,time_phrase开始位置的标记位置。
  • time_pos_end: 整数类型,time_phrase结束位置的标记位置(不包括)。
  • tok_context: 字符串类型,time_phrase出现的上下文,以空格分隔的标记。

数据分割

数据集没有进行分割。

数据集创建

策划理由

时间流是我们行动的不可或缺的指南,并为事件的逻辑进展提供框架。在大多数虚构作品中,故事事件发生在一天中的可识别时间段内。识别故事的时间流程对于理解文本至关重要。

源数据

初始数据收集和规范化

数据来自Project Gutenberg中的52,183本小说。

源语言生产者

小说作者。

注释

注释过程

手动注释。

注释者

两位作者。

个人和敏感信息

数据集中不包含个人或敏感信息。

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见的讨论

[更多信息需补充]

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

Allen Kim, Charuta Pethe, Steven Skiena, Stony Brook University

许可信息

[更多信息需补充]

引用信息

@misc{kim2020time, title={What time is it? Temporal Analysis of Novels}, author={Allen Kim and Charuta Pethe and Steven Skiena}, year={2020}, eprint={2011.04124}, archivePrefix={arXiv}, primaryClass={cs.CL} }

贡献

感谢@TevenLeScao添加此数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作