---
language:
- en
paperswithcode_id: cornell-movie-dialogs-corpus
pretty_name: Cornell Movie-Dialogs Corpus
dataset_info:
features:
- name: movieID
dtype: string
- name: movieTitle
dtype: string
- name: movieYear
dtype: string
- name: movieIMDBRating
dtype: string
- name: movieNoIMDBVotes
dtype: string
- name: movieGenres
sequence: string
- name: characterID1
dtype: string
- name: characterID2
dtype: string
- name: characterName1
dtype: string
- name: characterName2
dtype: string
- name: utterance
sequence:
- name: text
dtype: string
- name: LineID
dtype: string
splits:
- name: train
num_bytes: 19548840
num_examples: 83097
download_size: 9916637
dataset_size: 19548840
---
# Dataset Card for "cornell_movie_dialog"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 9.92 MB
- **Size of the generated dataset:** 19.55 MB
- **Total amount of disk used:** 29.46 MB
### Dataset Summary
This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:
- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata included:
- genres
- release year
- IMDB rating
- number of IMDB votes
- IMDB rating
- character metadata included:
- gender (for 3,774 characters)
- position on movie credits (3,321 characters)
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### default
- **Size of downloaded dataset files:** 9.92 MB
- **Size of the generated dataset:** 19.55 MB
- **Total amount of disk used:** 29.46 MB
An example of 'train' looks as follows.
```
{
"characterID1": "u0 ",
"characterID2": " u2 ",
"characterName1": " m0 ",
"characterName2": " m0 ",
"movieGenres": ["comedy", "romance"],
"movieID": " m0 ",
"movieIMDBRating": " 6.90 ",
"movieNoIMDBVotes": " 62847 ",
"movieTitle": " f ",
"movieYear": " 1999 ",
"utterance": {
"LineID": ["L1"],
"text": ["L1 "]
}
}
```
### Data Fields
The data fields are the same among all splits.
#### default
- `movieID`: a `string` feature.
- `movieTitle`: a `string` feature.
- `movieYear`: a `string` feature.
- `movieIMDBRating`: a `string` feature.
- `movieNoIMDBVotes`: a `string` feature.
- `movieGenres`: a `list` of `string` features.
- `characterID1`: a `string` feature.
- `characterID2`: a `string` feature.
- `characterName1`: a `string` feature.
- `characterName2`: a `string` feature.
- `utterance`: a dictionary feature containing:
- `text`: a `string` feature.
- `LineID`: a `string` feature.
### Data Splits
| name |train|
|-------|----:|
|default|83097|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
@InProceedings{Danescu-Niculescu-Mizil+Lee:11a,
author={Cristian Danescu-Niculescu-Mizil and Lillian Lee},
title={Chameleons in imagined conversations:
A new approach to understanding coordination of linguistic style in dialogs.},
booktitle={Proceedings of the
Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011},
year={2011}
}
```
### Contributions
Thanks to [@mariamabarham](https://github.com/mariamabarham), [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
语言:
- 英语
paperswithcode_id: cornell-movie-dialogs-corpus
pretty_name: 康奈尔电影对话语料库(Cornell Movie-Dialogs Corpus)
dataset_info:
features:
- name: movieID
dtype: string
- name: movieTitle
dtype: string
- name: movieYear
dtype: string
- name: movieIMDBRating
dtype: string
- name: movieNoIMDBVotes
dtype: string
- name: movieGenres
sequence: string
- name: characterID1
dtype: string
- name: characterID2
dtype: string
- name: characterName1
dtype: string
- name: characterName2
dtype: string
- name: utterance
sequence:
- name: text
dtype: string
- name: LineID
dtype: string
splits:
- name: train
num_bytes: 19548840
num_examples: 83097
download_size: 9916637
dataset_size: 19548840
# 「cornell_movie_dialog」数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持的任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页**:[http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
- **代码仓库**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系方式**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集文件大小**:9.92 MB
- **生成后的数据集大小**:19.55 MB
- **总磁盘占用空间**:29.46 MB
### 数据集摘要
本语料库包含从原始电影剧本中提取的、富含元数据的大型虚构对话集合:
- 10292对电影角色间共计220579次对话交互
- 涵盖617部电影中的9035个角色
- 总计包含304713条话语
- 附带电影元数据:
- 电影流派
- 上映年份
- IMDB评分
- IMDB投票数
- IMDB评分
- 附带角色元数据:
- 角色性别(覆盖3774个角色)
- 角色在电影演职员表中的位置(覆盖3321个角色)
### 支持的任务与排行榜
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### 默认配置
- **下载数据集文件大小**:9.92 MB
- **生成后的数据集大小**:19.55 MB
- **总磁盘占用空间**:29.46 MB
训练集(train)的示例如下:
{
"characterID1": "u0 ",
"characterID2": " u2 ",
"characterName1": " m0 ",
"characterName2": " m0 ",
"movieGenres": ["comedy", "romance"],
"movieID": " m0 ",
"movieIMDBRating": " 6.90 ",
"movieNoIMDBVotes": " 62847 ",
"movieTitle": " f ",
"movieYear": " 1999 ",
"utterance": {
"LineID": ["L1"],
"text": ["L1 "]
}
}
### 数据字段
所有数据划分下的字段均保持一致。
#### 默认配置
- `movieID`: 字符串类型特征
- `movieTitle`: 字符串类型特征
- `movieYear`: 字符串类型特征
- `movieIMDBRating`: 字符串类型特征
- `movieNoIMDBVotes`: 字符串类型特征
- `movieGenres`: 字符串列表类型特征
- `characterID1`: 字符串类型特征
- `characterID2`: 字符串类型特征
- `characterName1`: 字符串类型特征
- `characterName2`: 字符串类型特征
- `utterance`: 字典类型特征,包含:
- `text`: 字符串类型特征
- `LineID`: 字符串类型特征
### 数据划分
| 划分名称 | 训练集样本数 |
|-------|----:|
| 默认配置 | 83097 |
## 数据集构建
### 构建依据
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与规范化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言创作者是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注信息
#### 标注流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注人员是谁?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集策展人
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 引用信息
@InProceedings{Danescu-Niculescu-Mizil+Lee:11a,
author={Cristian Danescu-Niculescu-Mizil and Lillian Lee},
title={Chameleons in imagined conversations:
A new approach to understanding coordination of linguistic style in dialogs.},
booktitle={Proceedings of the
Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011},
year={2011}
}
### 贡献致谢
感谢[@mariamabarham](https://github.com/mariamabarham)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@thomwolf](https://github.com/thomwolf) 为本数据集的添加作出的贡献。