sedthh/tv_dialogue
收藏Hugging Face2023-03-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sedthh/tv_dialogue
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: TEXT
dtype: string
- name: METADATA
dtype: string
- name: SOURCE
dtype: string
splits:
- name: train
num_bytes: 211728118
num_examples: 2781
download_size: 125187885
dataset_size: 211728118
license: mit
task_categories:
- conversational
- text2text-generation
- text-generation
language:
- en
tags:
- OpenAssistant
- transcripts
- subtitles
- television
pretty_name: TV and Movie dialogue and transcript corpus
size_categories:
- 1K<n<10K
---
# Dataset Card for "tv_dialogue"
This dataset contains transcripts for famous movies and TV shows from multiple sources.
An example dialogue would be:
```
[PERSON 1] Hello
[PERSON 2] Hello Person 2!
How's it going?
(they are both talking)
[PERSON 1] I like being an example
on Huggingface!
They are examples on Huggingface.
CUT OUT TO ANOTHER SCENCE
We are somewhere else
[PERSON 1 (v.o)] I wonder where we are?
```
All dialogues were processed to follow this format. Each row is a single episode / movie (**2781** rows total)
following the [OpenAssistant](https://open-assistant.io/) format. The METADATA column contains dditional information as a JSON string.
## Dialogue only, with some information on the scene
| Show | Number of scripts | Via | Source |
|----|----|---|---|
| Friends | 236 episodes | https://github.com/emorynlp/character-mining | friends/emorynlp |
| The Office | 186 episodes | https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript | office/nasirkhalid24 |
| Marvel Cinematic Universe | 18 movies | https://www.kaggle.com/datasets/pdunton/marvel-cinematic-universe-dialogue | marvel/pdunton |
| Doctor Who | 306 episodes | https://www.kaggle.com/datasets/jeanmidev/doctor-who | drwho/jeanmidev |
| Star Trek | 708 episodes | http://www.chakoteya.net/StarTrek/index.html based on https://github.com/GJBroughton/Star_Trek_Scripts/ | statrek/chakoteya |
## Actual transcripts with detailed information on the scenes
| Show | Number of scripts | Via | Source |
|----|----|---|---|
| Top Movies | 919 movies | https://imsdb.com/ | imsdb |
| Top Movies | 171 movies | https://www.dailyscript.com/ | dailyscript |
| Stargate SG-1 | 18 episodes | https://imsdb.com/ | imsdb |
| South Park | 129 episodes | https://imsdb.com/ | imsdb |
| Knight Rider | 80 episodes | http://www.knightriderarchives.com/ | knightriderarchives |
提供机构:
sedthh
原始信息汇总
数据集概述
基本信息
- 名称: TV and Movie dialogue and transcript corpus
- 别名: tv_dialogue
- 语言: 英语 (en)
- 任务类别:
- 对话式
- 文本到文本生成
- 文本生成
- 标签:
- OpenAssistant
- 转录
- 字幕
- 电视
- 许可证: MIT
- 大小类别: 1K<n<10K
数据集结构
- 特征:
- TEXT: 字符串类型
- METADATA: 字符串类型
- SOURCE: 字符串类型
- 分割:
- train: 2781个示例,总字节数211728118
- 下载大小: 125187885字节
- 数据集大小: 211728118字节
内容描述
- 内容类型: 电影和电视节目的对话及转录文本
- 示例格式: 每个对话包含人物标识和对话内容,如
[PERSON 1] Hello - 数据量: 总共2781个条目,每个条目代表一个剧集或电影
- 元数据: METADATA列包含JSON格式的额外信息
数据源
- 电视剧集:
- Friends: 236 episodes
- The Office: 186 episodes
- Doctor Who: 306 episodes
- Star Trek: 708 episodes
- 电影:
- Marvel Cinematic Universe: 18 movies
- 其他:
- Top Movies: 919 movies
- South Park: 129 episodes
- Knight Rider: 80 episodes



