time_dial
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/time_dial
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for TimeDial: Temporal Commonsense Reasoning in Dialog
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [TimeDial](https://github.com/google-research-datasets/timedial)
- **Paper:** [TimeDial: Temporal Commonsense Reasoning in Dialog](https://arxiv.org/abs/2106.04571)
- **Point of Contact:** [Please create an issue in the official repository](https://github.com/google-research-datasets/timedial)
### Dataset Summary
TimeDial presents a crowdsourced English challenge set, for temporal commonsense reasoning, formulated as a multiple choice cloze task with around 1.5k carefully curated dialogs. The dataset is derived from the DailyDialog ([Li et al., 2017](https://www.aclweb.org/anthology/I17-1099/)), which is a multi-turn dialog corpus.
In order to establish strong baselines and provide information on future model development, the authors conducted extensive experiments with state-of-the-art LMs. While humans can easily answer these questions (97.8\%), the best T5 model variant struggles on this challenge set (73\%). Moreover, our qualitative error analyses show that the models often rely on shallow, spurious features (particularly text matching), instead of truly doing reasoning over the context.
Detailed experiments and analyses can be found in their [paper](https://arxiv.org/pdf/2106.04571.pdf).
### Supported Tasks and Leaderboards
To be updated soon.
### Languages
The dataset is in English only.
## Dataset Structure
### Data Instances
```
{
"id": 1,
"conversation": [
"A: We need to take the accounts system offline to carry out the upgrade . But don't worry , it won't cause too much inconvenience . We're going to do it over the weekend .",
"B: How long will the system be down for ?",
"A: We'll be taking everything offline in about two hours ' time . It'll be down for a minimum of twelve hours . If everything goes according to plan , it should be up again by 6 pm on Saturday .",
"B: That's fine . We've allowed <MASK> to be on the safe side ."
],
"correct1": "forty-eight hours",
"correct2": "50 hours ",
"incorrect1": "two hours ",
"incorrect1_rule": "Rule 1",
"incorrect2": "12 days ",
"incorrect2_rule": "Rule 2"
}
```
### Data Fields
- "id": Unique identifier, as a integer
- "conversation": Dialog context with <MASK> span, as a string
- "correct1": Original <MASK> span, as a string
- "correct2": Additional correct option provided by annotators, as a string
- "incorrect1": Incorrect option #1 provided by annotators, as a string
- "incorrect1_rule": One of phrase matching ("Rule 1"), numeral matching ("Rule 2"), or open ended ("Rule 3"), as a string
- "incorrect2": Incorrect option #2 provided by annotators, as a string
- "incorrect2_rule": One of phrase matching ("Rule 1"), numeral matching ("Rule 2"), or open ended ("Rule 3"), as a string
### Data Splits
TimeDial dataset consists only of a test set of 1,104 dialog instances with 2 correct and 2 incorrect options with the following statistics:
| | Avg. |
|-----|-----|
|Turns per Dialog | 11.7 |
|Words per Turn | 16.5 |
|Time Spans per Dialog | 3 |
## Dataset Creation
### Curation Rationale
Although previous works have studied temporal reasoning in natural language, they have either focused on specific time-related concepts in isolation, such as temporal ordering and relation extraction, and/or dealt with limited context, such as single-sentence-based question answering and natural language inference.
In this work, they make the first systematic study of temporal commonsense reasoning in a multi-turn dialog setting. The task involves complex reasoning that requires operations like comparison and arithmetic reasoning over temporal expressions and the need for commonsense and world knowledge.
### Source Data
#### Initial Data Collection and Normalization
The TIMEDIAL dataset is derived from DailyDialog data (Li et al., 2017), which is a multi-turn dialog corpus containing over 13K English dialogs. Dialogs in this dataset consist of turn-taking between two people on topics over 10 broad categories, ranging from daily lives to financial topics.
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
The data collection process involves two steps: (1) identifying dialogs that are rich in temporal expressions, and (2) asking human annotators to provide correct and incorrect options for cloze instances derived from these dialogs. More details about the two steps:
1) Temporal expression identification: Here, they select dialogs that are rich with temporal information, in order to focus on complex temporal reasoning that arises in natural dialogs. Temporal expressions are automatically identified with SU-Time, an off-the-shelf temporal expression detector. They keep only the dialogs with more than 3 temporal expressions and at least one expression that contains numerals like “two weeks” (as opposed to non-numeric spans, like “summer”, “right now”, and “later”). In their initial experiment, they observe that language models can often correctly predict these non-numerical temporal phrases.
2) Human annotated options: Next, they make spans in the dialogs. For a dialog, they mask out each temporal expression that contains numerals, each resulting in a cloze question that is then sent for human annotation.
This resulted in 1,526 instances for annotation. For each masked span in each dialog, they obtain human annotation to derive a fixed set of correct and incorrect options given the context. Concretely, given a masked dialog and a seed correct answer (i.e., the original text) for the masked span, the annotators were asked to (1) come up with an alternative correct answer that makes sense in the dialog adhering to commonsense, and (2) formulate two incorrect answers that have no possibility of making sense in the dialog context. They highlight all time expressions in the context to make it easier for annotators to select reasonable time expressions.
#### Who are the annotators?
They are English linguists.
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
Dataset provided for research purposes only. Please check dataset license for additional information.
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
TimeDial dataset is licensed under CC BY-NC-SA 4.0.
### Citation Information
```
@inproceedings{qin-etal-2021-timedial,
title = "{TimeDial: Temporal Commonsense Reasoning in Dialog}",
author = "Qin, Lianhui and Gupta, Aditya and Upadhyay, Shyam and He, Luheng and Choi, Yejin and Faruqui, Manaal",
booktitle = "Proc. of ACL",
year = "2021"
}
```
### Contributions
Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.
# TimeDial数据集卡片:对话中的时间常识推理
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持的任务与基准排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建动因](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知限制](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页**:[TimeDial](https://github.com/google-research-datasets/timedial)
- **论文**:[TimeDial: Temporal Commonsense Reasoning in Dialog](https://arxiv.org/abs/2106.04571)
- **联系方式**:[请在官方仓库提交Issue](https://github.com/google-research-datasets/timedial)
### 数据集概述
TimeDial是一个众包构建的英语挑战集,用于时间常识推理任务,被形式化为多项选择完形填空任务,包含约1500个精心整理的多轮对话。该数据集源自多轮对话语料库DailyDialog(Li等人,2017)。
为了构建强劲的基准模型并为后续模型开发提供参考,作者使用当前最先进的大语言模型(Large Language Model, LLM)开展了大量实验。人类可轻松答对该挑战集中的问题(准确率达97.8%),但表现最优的T5模型变体仅能达到73%的准确率。此外,我们的定性错误分析显示,模型往往依赖浅层、虚假的特征(尤其是文本匹配),而非真正基于上下文进行推理。
详细的实验与分析可参阅其[论文](https://arxiv.org/pdf/2106.04571.pdf)。
### 支持的任务与基准排行榜
即将更新。
### 语言
该数据集仅包含英语语料。
## 数据集结构
### 数据实例
{
"id": 1,
"conversation": [
"A: We need to take the accounts system offline to carry out the upgrade . But don't worry , it won't cause too much inconvenience . We're going to do it over the weekend .",
"B: How long will the system be down for ?",
"A: We'll be taking everything offline in about two hours ' time . It'll be down for a minimum of twelve hours . If everything goes according to plan , it should be up again by 6 pm on Saturday .",
"B: That's fine . We've allowed <MASK> to be on the safe side ."
],
"correct1": "forty-eight hours",
"correct2": "50 hours ",
"incorrect1": "two hours ",
"incorrect1_rule": "Rule 1",
"incorrect2": "12 days ",
"incorrect2_rule": "Rule 2"
}
### 数据字段
- "id": 唯一整数标识符
- "conversation": 带有<MASK>掩码的对话上下文,字符串格式
- "correct1": 掩码位置的原始文本,字符串格式
- "correct2": 标注员提供的额外正确选项,字符串格式
- "incorrect1": 标注员提供的第一个错误选项,字符串格式
- "incorrect1_rule": 错误规则,可选值为短语匹配("Rule 1")、数值匹配("Rule 2")或开放式("Rule 3"),字符串格式
- "incorrect2": 标注员提供的第二个错误选项,字符串格式
- "incorrect2_rule": 错误规则,可选值为短语匹配("Rule 1")、数值匹配("Rule 2")或开放式("Rule 3"),字符串格式
### 数据划分
TimeDial数据集仅包含1104个对话实例的测试集,每个实例带有2个正确选项与2个错误选项,统计信息如下:
| | 平均值 |
|-----|-----|
|每对话平均轮次 | 11.7 |
|每轮平均词数 | 16.5 |
|每对话平均时间跨度 | 3 |
## 数据集构建
### 构建动因
尽管此前已有研究探索自然语言中的时间推理,但现有工作要么单独关注特定时间相关概念(如时间排序与关系抽取),要么仅处理有限上下文(如基于单句的问答与自然语言推理)。
本工作首次系统研究多轮对话场景下的时间常识推理任务。该任务涉及复杂推理,需要对时间表达式进行比较、算术推理等操作,同时依赖常识与世界知识。
### 源数据
#### 初始数据收集与归一化
TimeDial数据集源自DailyDialog语料库(Li等人,2017),该语料库包含超过13000个英语多轮对话。对话由两人围绕10大类话题展开,涵盖日常生活至金融主题等各类场景。
#### 源语言生产者是谁?
[需补充更多信息]
### 标注
#### 标注流程
数据收集包含两个步骤:(1) 识别富含时间表达式的对话;(2) 请人类标注员为从这些对话中生成的完形填空实例提供正确与错误选项。具体细节如下:
1) 时间表达式识别:作者选择富含时间信息的对话,以聚焦自然对话中出现的复杂时间推理任务。使用现成的时间表达式检测工具SU-Time自动识别时间表达式,仅保留包含至少3个时间表达式、且至少一个表达式带有数值(如“两周”,而非“夏天”“现在”“稍后”这类非数值短语)的对话。在初始实验中,作者发现大语言模型通常可正确预测这类非数值时间短语。
2) 人工标注选项:接下来,作者将对话中的每个带数值的时间表达式进行掩码,每个掩码位置对应一个完形问题,随后交由人工标注。该流程共生成1526个标注实例。对于每个对话中的掩码位置,作者获取人类标注以得到一组固定的正确与错误选项。具体而言,给定掩码后的对话与掩码位置的原始正确答案,标注员需完成两项任务:(1) 提出一个符合对话上下文与常识的替代正确答案;(2) 生成两个在该对话上下文中完全不合理的错误答案。作者预先高亮了上下文中的所有时间表达式,以方便标注员选择合理的时间表达式。
#### 标注人员是谁?
标注人员为英语语言学家。
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知限制
本数据集仅用于研究用途,详细信息请查阅数据集许可协议。
## 附加信息
### 数据集整理者
[需补充更多信息]
### 许可信息
TimeDial数据集采用CC BY-NC-SA 4.0许可协议。
### 引用信息
@inproceedings{qin-etal-2021-timedial,
title = "{TimeDial: Temporal Commonsense Reasoning in Dialog}",
author = "Qin, Lianhui and Gupta, Aditya and Upadhyay, Shyam and He, Luheng and Choi, Yejin and Faruqui, Manaal",
booktitle = "Proc. of ACL",
year = "2021"
}
### 贡献
感谢[@bhavitvyamalik](https://github.com/bhavitvyamalik) 为本数据集添加条目。
提供机构:
maas
创建时间:
2025-07-07



