five

cornell-movie-dialog/cornell_movie_dialog

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/cornell-movie-dialog/cornell_movie_dialog
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en paperswithcode_id: cornell-movie-dialogs-corpus pretty_name: Cornell Movie-Dialogs Corpus dataset_info: features: - name: movieID dtype: string - name: movieTitle dtype: string - name: movieYear dtype: string - name: movieIMDBRating dtype: string - name: movieNoIMDBVotes dtype: string - name: movieGenres sequence: string - name: characterID1 dtype: string - name: characterID2 dtype: string - name: characterName1 dtype: string - name: characterName2 dtype: string - name: utterance sequence: - name: text dtype: string - name: LineID dtype: string splits: - name: train num_bytes: 19548840 num_examples: 83097 download_size: 9916637 dataset_size: 19548840 --- # Dataset Card for "cornell_movie_dialog" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 9.92 MB - **Size of the generated dataset:** 19.55 MB - **Total amount of disk used:** 29.46 MB ### Dataset Summary This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: - 220,579 conversational exchanges between 10,292 pairs of movie characters - involves 9,035 characters from 617 movies - in total 304,713 utterances - movie metadata included: - genres - release year - IMDB rating - number of IMDB votes - IMDB rating - character metadata included: - gender (for 3,774 characters) - position on movie credits (3,321 characters) ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 9.92 MB - **Size of the generated dataset:** 19.55 MB - **Total amount of disk used:** 29.46 MB An example of 'train' looks as follows. ``` { "characterID1": "u0 ", "characterID2": " u2 ", "characterName1": " m0 ", "characterName2": " m0 ", "movieGenres": ["comedy", "romance"], "movieID": " m0 ", "movieIMDBRating": " 6.90 ", "movieNoIMDBVotes": " 62847 ", "movieTitle": " f ", "movieYear": " 1999 ", "utterance": { "LineID": ["L1"], "text": ["L1 "] } } ``` ### Data Fields The data fields are the same among all splits. #### default - `movieID`: a `string` feature. - `movieTitle`: a `string` feature. - `movieYear`: a `string` feature. - `movieIMDBRating`: a `string` feature. - `movieNoIMDBVotes`: a `string` feature. - `movieGenres`: a `list` of `string` features. - `characterID1`: a `string` feature. - `characterID2`: a `string` feature. - `characterName1`: a `string` feature. - `characterName2`: a `string` feature. - `utterance`: a dictionary feature containing: - `text`: a `string` feature. - `LineID`: a `string` feature. ### Data Splits | name |train| |-------|----:| |default|83097| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @InProceedings{Danescu-Niculescu-Mizil+Lee:11a, author={Cristian Danescu-Niculescu-Mizil and Lillian Lee}, title={Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.}, booktitle={Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011}, year={2011} } ``` ### Contributions Thanks to [@mariamabarham](https://github.com/mariamabarham), [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

语言: - 英语 paperswithcode_id: cornell-movie-dialogs-corpus pretty_name: 康奈尔电影对话语料库(Cornell Movie-Dialogs Corpus) dataset_info: features: - name: movieID dtype: string - name: movieTitle dtype: string - name: movieYear dtype: string - name: movieIMDBRating dtype: string - name: movieNoIMDBVotes dtype: string - name: movieGenres sequence: string - name: characterID1 dtype: string - name: characterID2 dtype: string - name: characterName1 dtype: string - name: characterName2 dtype: string - name: utterance sequence: - name: text dtype: string - name: LineID dtype: string splits: - name: train num_bytes: 19548840 num_examples: 83097 download_size: 9916637 dataset_size: 19548840 # 「cornell_movie_dialog」数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持的任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**:[http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) - **代码仓库**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小**:9.92 MB - **生成后的数据集大小**:19.55 MB - **总磁盘占用空间**:29.46 MB ### 数据集摘要 本语料库包含从原始电影剧本中提取的、富含元数据的大型虚构对话集合: - 10292对电影角色间共计220579次对话交互 - 涵盖617部电影中的9035个角色 - 总计包含304713条话语 - 附带电影元数据: - 电影流派 - 上映年份 - IMDB评分 - IMDB投票数 - IMDB评分 - 附带角色元数据: - 角色性别(覆盖3774个角色) - 角色在电影演职员表中的位置(覆盖3321个角色) ### 支持的任务与排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 默认配置 - **下载数据集文件大小**:9.92 MB - **生成后的数据集大小**:19.55 MB - **总磁盘占用空间**:29.46 MB 训练集(train)的示例如下: { "characterID1": "u0 ", "characterID2": " u2 ", "characterName1": " m0 ", "characterName2": " m0 ", "movieGenres": ["comedy", "romance"], "movieID": " m0 ", "movieIMDBRating": " 6.90 ", "movieNoIMDBVotes": " 62847 ", "movieTitle": " f ", "movieYear": " 1999 ", "utterance": { "LineID": ["L1"], "text": ["L1 "] } } ### 数据字段 所有数据划分下的字段均保持一致。 #### 默认配置 - `movieID`: 字符串类型特征 - `movieTitle`: 字符串类型特征 - `movieYear`: 字符串类型特征 - `movieIMDBRating`: 字符串类型特征 - `movieNoIMDBVotes`: 字符串类型特征 - `movieGenres`: 字符串列表类型特征 - `characterID1`: 字符串类型特征 - `characterID2`: 字符串类型特征 - `characterName1`: 字符串类型特征 - `characterName2`: 字符串类型特征 - `utterance`: 字典类型特征,包含: - `text`: 字符串类型特征 - `LineID`: 字符串类型特征 ### 数据划分 | 划分名称 | 训练集样本数 | |-------|----:| | 默认配置 | 83097 | ## 数据集构建 ### 构建依据 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与规范化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言创作者是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集策展人 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @InProceedings{Danescu-Niculescu-Mizil+Lee:11a, author={Cristian Danescu-Niculescu-Mizil and Lillian Lee}, title={Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.}, booktitle={Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011}, year={2011} } ### 贡献致谢 感谢[@mariamabarham](https://github.com/mariamabarham)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@thomwolf](https://github.com/thomwolf) 为本数据集的添加作出的贡献。
提供机构:
cornell-movie-dialog
原始信息汇总

数据集卡片:Cornell Movie-Dialogs Corpus

数据集描述

数据集概要

Cornell Movie-Dialogs Corpus 包含大量丰富的元数据,来自原始电影剧本的虚构对话集合:

  • 220,579 个对话交换,涉及 10,292 对电影角色
  • 涉及 9,035 个角色,来自 617 部电影
  • 总计 304,713 个语句
  • 包含电影元数据:
    • 类型
    • 发行年份
    • IMDB 评分
    • IMDB 投票数
    • IMDB 评分
  • 包含角色元数据:
    • 性别(3,774 个角色)
    • 电影片尾字幕中的位置(3,321 个角色)

数据集结构

数据实例

默认

  • 下载的数据文件大小: 9.92 MB
  • 生成的数据集大小: 19.55 MB
  • 总磁盘使用量: 29.46 MB

一个 train 示例如下: json { "characterID1": "u0 ", "characterID2": " u2 ", "characterName1": " m0 ", "characterName2": " m0 ", "movieGenres": ["comedy", "romance"], "movieID": " m0 ", "movieIMDBRating": " 6.90 ", "movieNoIMDBVotes": " 62847 ", "movieTitle": " f ", "movieYear": " 1999 ", "utterance": { "LineID": ["L1"], "text": ["L1 "] } }

数据字段

所有分割的数据字段相同:

默认

  • movieID: 字符串特征
  • movieTitle: 字符串特征
  • movieYear: 字符串特征
  • movieIMDBRating: 字符串特征
  • movieNoIMDBVotes: 字符串特征
  • movieGenres: 字符串列表特征
  • characterID1: 字符串特征
  • characterID2: 字符串特征
  • characterName1: 字符串特征
  • characterName2: 字符串特征
  • utterance: 包含以下字段的字典特征:
    • text: 字符串特征
    • LineID: 字符串特征

数据分割

名称 训练集
默认 83097

数据集创建

数据集策展理由

更多信息需要

源数据

初始数据收集和规范化

更多信息需要

源语言生产者是谁?

更多信息需要

注释

注释过程

更多信息需要

注释者是谁?

更多信息需要

个人和敏感信息

更多信息需要

使用数据的注意事项

数据集的社会影响

更多信息需要

偏见的讨论

更多信息需要

其他已知限制

更多信息需要

附加信息

数据集策展人

更多信息需要

许可信息

更多信息需要

引用信息

plaintext @InProceedings{Danescu-Niculescu-Mizil+Lee:11a, author={Cristian Danescu-Niculescu-Mizil and Lillian Lee}, title={Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.}, booktitle={Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011}, year={2011} }

贡献

感谢 @mariamabarham, @patrickvonplaten, @thomwolf 添加此数据集。

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集的构建基于从电影剧本中提取的大量虚构对话,涵盖了617部电影中的9,035个角色之间的220,579次对话交流,总计304,713条语句。数据集不仅包含对话内容,还附带了丰富的电影和角色元数据,如电影的类型、发行年份、IMDB评分及投票数,以及角色的性别和在电影中的位置等。
特点
该数据集的显著特点在于其丰富的元数据和大规模的对话内容,为研究对话系统、语言风格协调等提供了宝贵的资源。此外,数据集的结构化设计使得研究者能够轻松访问和分析对话的上下文信息,从而支持多种自然语言处理任务。
使用方法
该数据集适用于多种自然语言处理任务,如对话生成、情感分析、角色关系建模等。研究者可以通过访问数据集中的对话内容和元数据,进行深入的语义和语用分析。使用时,建议结合具体的任务需求,选择合适的特征和数据分割,以最大化数据集的应用价值。
背景与挑战
背景概述
Cornell Movie-Dialogs Corpus(康奈尔电影对话语料库)是由康奈尔大学的Cristian Danescu-Niculescu-Mizil和Lillian Lee于2011年创建的,旨在研究对话中的语言风格协调问题。该数据集包含了从617部电影中提取的220,579个对话片段,涉及9,035个角色之间的10,292对对话,总计304,713条语句。此外,数据集还提供了丰富的电影和角色元数据,如电影类型、上映年份、IMDB评分等,为研究对话系统、自然语言处理和语言学提供了宝贵的资源。
当前挑战
Cornell Movie-Dialogs Corpus在构建过程中面临多个挑战。首先,从电影剧本中提取对话并确保其准确性和完整性是一项复杂任务,涉及大量的数据清洗和格式化工作。其次,尽管数据集提供了丰富的元数据,但如何有效地利用这些信息进行对话生成、情感分析等任务仍是一个挑战。此外,由于数据集主要基于电影对话,其语言风格和内容可能与现实生活中的对话存在差异,这可能影响其在实际应用中的表现。
常用场景
经典使用场景
康奈尔电影对话语料库(Cornell Movie-Dialogs Corpus)因其丰富的元数据和大规模的对话内容,成为自然语言处理领域中对话系统研究的核心资源。该数据集常用于训练和评估对话生成模型,特别是在生成连贯且上下文相关的对话响应方面。此外,它也被广泛应用于情感分析、对话行为识别以及语言风格协调等任务,为研究者提供了丰富的实验数据。
实际应用
康奈尔电影对话语料库在实际应用中具有广泛的前景。它被用于开发智能客服系统、虚拟助手以及社交机器人,帮助这些系统更好地理解和生成自然语言对话。此外,该数据集还为电影剧本分析、角色情感建模等创意产业提供了数据支持,促进了人工智能与文化创意的深度融合。
衍生相关工作
基于康奈尔电影对话语料库的研究工作层出不穷,涵盖了对话生成、情感分析、风格迁移等多个领域。例如,研究者利用该数据集开发了基于神经网络的对话生成模型,显著提升了对话的连贯性和自然度。此外,该数据集还启发了对电影对话中语言风格协调的研究,推动了对话系统在风格适应性方面的进展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作