mdd
收藏魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/mdd
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MDD
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**[The bAbI project](https://research.fb.com/downloads/babi/)
- **Repository:**
- **Paper:** [arXiv Paper](https://arxiv.org/pdf/1511.06931.pdf)
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
The Movie Dialog dataset (MDD) is designed to measure how well models can perform at goal and non-goal orientated dialog centered around the topic of movies (question answering, recommendation and discussion), from various movie reviews sources such as MovieLens and OMDb.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The data is present in English language as written by users on OMDb and MovieLens websites.
## Dataset Structure
### Data Instances
An instance from the `task3_qarecs` config's `train` split:
```
{'dialogue_turns': {'speaker': [0, 1, 0, 1, 0, 1], 'utterance': ["I really like Jaws, Bottle Rocket, Saving Private Ryan, Tommy Boy, The Muppet Movie, Face/Off, and Cool Hand Luke. I'm looking for a Documentary movie.", 'Beyond the Mat', 'Who is that directed by?', 'Barry W. Blaustein', 'I like Jon Fauer movies more. Do you know anything else?', 'Cinematographer Style']}}
```
An instance from the `task4_reddit` config's `cand-valid` split:
```
{'dialogue_turns': {'speaker': [0], 'utterance': ['MORTAL KOMBAT !']}}
```
### Data Fields
For all configurations:
- `dialogue_turns`: a dictionary feature containing:
- `speaker`: an integer with possible values including `0`, `1`, indicating which speaker wrote the utterance.
- `utterance`: a `string` feature containing the text utterance.
### Data Splits
The splits and corresponding sizes are:
|config |train |test |validation|cand_valid|cand_test|
|:--|------:|----:|---------:|----:|----:|
|task1_qa|96185|9952|9968|-|-|
|task2_recs|1000000|10000|10000|-|-|
|task3_qarecs|952125|4915|5052|-|-|
|task4_reddit|945198|10000|10000|10000|10000|
The `cand_valid` and `cand_test` are negative candidates for the `task4_reddit` configuration which is used in ranking true positive against these candidates and hits@k (or another ranking metric) is reported. (See paper)
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
The construction of the tasks depended on some existing datasets:
1) MovieLens. The data was downloaded from: http://grouplens.org/datasets/movielens/20m/ on May 27th, 2015.
2) OMDB. The data was downloaded from: http://beforethecode.com/projects/omdb/download.aspx on May 28th, 2015.
3) For `task4_reddit`, the data is a processed subset (movie subreddit only) of the data available at:
https://www.reddit.com/r/datasets/comments/3bxlg7
#### Who are the source language producers?
Users on MovieLens, OMDB website and reddit websites, among others.
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston (at Facebook Research).
### Licensing Information
```
Creative Commons Attribution 3.0 License
```
### Citation Information
```
@misc{dodge2016evaluating,
title={Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems},
author={Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston},
year={2016},
eprint={1511.06931},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@gchhablani](https://github.com/gchhablani) for adding this dataset.
# MDD 数据集卡片(Dataset Card)
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **主页**:[bAbI 项目(The bAbI project)](https://research.fb.com/downloads/babi/)
- **代码仓库**:
- **论文**:[arXiv 论文](https://arxiv.org/pdf/1511.06931.pdf)
- **排行榜**:
- **联系方式**:
### 数据集概述
电影对话数据集(Movie Dialog dataset, MDD)旨在评估模型在以电影为主题的目标导向与非目标导向对话中的表现能力,涵盖问答、推荐与讨论三类任务,数据源自MovieLens、OMDb等多个电影评论平台。
### 支持任务与排行榜
[更多信息待补充]
### 语言
本数据集的文本均为英文,由MovieLens与OMDb平台的用户撰写。
## 数据集结构
### 数据实例
以下是`task3_qarecs`配置的`train`划分下的一个数据实例:
{'dialogue_turns': {'speaker': [0, 1, 0, 1, 0, 1], 'utterance': ["I really like Jaws, Bottle Rocket, Saving Private Ryan, Tommy Boy, The Muppet Movie, Face/Off, and Cool Hand Luke. I'm looking for a Documentary movie.", 'Beyond the Mat', 'Who is that directed by?', 'Barry W. Blaustein', 'I like Jon Fauer movies more. Do you know anything else?', 'Cinematographer Style']}}
以下是`task4_reddit`配置的`cand-valid`划分下的一个数据实例:
{'dialogue_turns': {'speaker': [0], 'utterance': ['MORTAL KOMBAT !']}}
### 数据字段
针对所有配置:
- `dialogue_turns`:包含以下特征的字典:
- `speaker`:整数类型,可选值为`0`、`1`,用于标识当前发言者身份。
- `utterance`(话语):字符串类型特征,存储会话话语文本。
### 数据划分
各配置的数据划分与对应规模如下:
|配置名称|训练集|测试集|验证集|候选验证集|候选测试集|
|:--|------:|----:|---------:|----:|----:|
|task1_qa|96185|9952|9968|-|-|
|task2_recs|1000000|10000|10000|-|-|
|task3_qarecs|952125|4915|5052|-|-|
|task4_reddit|945198|10000|10000|10000|10000|
其中`cand_valid`与`cand_test`是`task4_reddit`配置的负候选集,该配置会基于这些候选集对正样本进行排序,并报告命中@k(hits@k)或其他排序指标的结果(详见论文)。
## 数据集构建
### 构建初衷
[更多信息待补充]
### 源数据
#### 初始数据收集与标准化
本数据集的任务构建依赖以下现有数据集:
1. MovieLens:数据于2015年5月27日从 http://grouplens.org/datasets/movielens/20m/ 下载获取。
2. OMDb:数据于2015年5月28日从 http://beforethecode.com/projects/omdb/download.aspx 下载获取。
3. 针对`task4_reddit`配置,其数据源自https://www.reddit.com/r/datasets/comments/3bxlg7 公开数据集的处理后子集(仅包含电影相关子版块数据)。
#### 源文本创作者
MovieLens、OMDb与Reddit平台的用户等。
### 标注
#### 标注流程
[更多信息待补充]
#### 标注人员
[更多信息待补充]
### 个人与敏感信息
[更多信息待补充]
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充]
### 偏差讨论
[更多信息待补充]
### 其他已知局限性
[更多信息待补充]
## 附加信息
### 数据集维护者
Jesse Dodge、Andreea Gane、Xiang Zhang、Antoine Bordes、Sumit Chopra、Alexander Miller、Arthur Szlam与Jason Weston(均来自Facebook Research)。
### 授权信息
Creative Commons Attribution 3.0 License(知识共享署名3.0协议)
### 引用信息
@misc{dodge2016evaluating,
title={Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems},
author={Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston},
year={2016},
eprint={1511.06931},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 贡献者
感谢[@gchhablani](https://github.com/gchhablani) 为本数据集添加至数据集库。
提供机构:
maas
创建时间:
2025-05-20



