five

sedthh/tv_dialogue

收藏
Hugging Face2023-03-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sedthh/tv_dialogue
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: TEXT dtype: string - name: METADATA dtype: string - name: SOURCE dtype: string splits: - name: train num_bytes: 211728118 num_examples: 2781 download_size: 125187885 dataset_size: 211728118 license: mit task_categories: - conversational - text2text-generation - text-generation language: - en tags: - OpenAssistant - transcripts - subtitles - television pretty_name: TV and Movie dialogue and transcript corpus size_categories: - 1K<n<10K --- # Dataset Card for "tv_dialogue" This dataset contains transcripts for famous movies and TV shows from multiple sources. An example dialogue would be: ``` [PERSON 1] Hello [PERSON 2] Hello Person 2! How's it going? (they are both talking) [PERSON 1] I like being an example on Huggingface! They are examples on Huggingface. CUT OUT TO ANOTHER SCENCE We are somewhere else [PERSON 1 (v.o)] I wonder where we are? ``` All dialogues were processed to follow this format. Each row is a single episode / movie (**2781** rows total) following the [OpenAssistant](https://open-assistant.io/) format. The METADATA column contains dditional information as a JSON string. ## Dialogue only, with some information on the scene | Show | Number of scripts | Via | Source | |----|----|---|---| | Friends | 236 episodes | https://github.com/emorynlp/character-mining | friends/emorynlp | | The Office | 186 episodes | https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript | office/nasirkhalid24 | | Marvel Cinematic Universe | 18 movies | https://www.kaggle.com/datasets/pdunton/marvel-cinematic-universe-dialogue | marvel/pdunton | | Doctor Who | 306 episodes | https://www.kaggle.com/datasets/jeanmidev/doctor-who | drwho/jeanmidev | | Star Trek | 708 episodes | http://www.chakoteya.net/StarTrek/index.html based on https://github.com/GJBroughton/Star_Trek_Scripts/ | statrek/chakoteya | ## Actual transcripts with detailed information on the scenes | Show | Number of scripts | Via | Source | |----|----|---|---| | Top Movies | 919 movies | https://imsdb.com/ | imsdb | | Top Movies | 171 movies | https://www.dailyscript.com/ | dailyscript | | Stargate SG-1 | 18 episodes | https://imsdb.com/ | imsdb | | South Park | 129 episodes | https://imsdb.com/ | imsdb | | Knight Rider | 80 episodes | http://www.knightriderarchives.com/ | knightriderarchives |
提供机构:
sedthh
原始信息汇总

数据集概述

基本信息

  • 名称: TV and Movie dialogue and transcript corpus
  • 别名: tv_dialogue
  • 语言: 英语 (en)
  • 任务类别:
    • 对话式
    • 文本到文本生成
    • 文本生成
  • 标签:
    • OpenAssistant
    • 转录
    • 字幕
    • 电视
  • 许可证: MIT
  • 大小类别: 1K<n<10K

数据集结构

  • 特征:
    • TEXT: 字符串类型
    • METADATA: 字符串类型
    • SOURCE: 字符串类型
  • 分割:
    • train: 2781个示例,总字节数211728118
  • 下载大小: 125187885字节
  • 数据集大小: 211728118字节

内容描述

  • 内容类型: 电影和电视节目的对话及转录文本
  • 示例格式: 每个对话包含人物标识和对话内容,如[PERSON 1] Hello
  • 数据量: 总共2781个条目,每个条目代表一个剧集或电影
  • 元数据: METADATA列包含JSON格式的额外信息

数据源

  • 电视剧集:
    • Friends: 236 episodes
    • The Office: 186 episodes
    • Doctor Who: 306 episodes
    • Star Trek: 708 episodes
  • 电影:
    • Marvel Cinematic Universe: 18 movies
  • 其他:
    • Top Movies: 919 movies
    • South Park: 129 episodes
    • Knight Rider: 80 episodes
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作