five

Serial Speakers: a Reproducible TV Series Dataset

收藏
DataCite Commons2020-09-04 更新2024-07-27 收录
下载链接:
https://figshare.com/articles/dataset/TV_Series_Corpus/3471839/7
下载链接
链接失效反馈
官方服务:
资源简介:
<b>Dataset of three TV Series</b> with <b>manual</b> annotations:<br><br>- <i>Breaking Bad</i>: S01--S05 (file 'bb.json')<br>- <i>Game of Thrones</i>: S01--07 (file 'got.json')<br>- <i>House of Cards</i>: S01--S02 (file 'hoc.json')<br><br>All three files are in .json format and contain TV Series annotated data.<br><br>Each TV Series is defined by its <b>name</b>,<br><br>A TV Series contains <b>seasons</b>, defined by their <b>id</b>s.<br><br>Every season is made of <b>episodes</b>, defined by their <b><b>id</b></b>s,<b> title</b>s, <b>duration </b>and<b> fps </b>.<br><br>Each episode contains two basic kinds of <b>data</b>: <b>scenes</b> and <b>speech segments</b>.<br><br>Scenes are defined by <b>start</b>ing points and are made of <b>shots </b>(Seasons 1 only)<b>.<br><br></b>A shot is defined by<b>:<br><br></b>- <b>Start</b>ing and <b>end</b>ing positions.-<b> </b>Recurring shot <b>id</b>s.<br>The speech segments are defined by their:<br>- <b>Start</b>ing and <b>end</b>ing points.<br>- <b>Text</b>ual content (here encrypted for copyright reasons).<br>- <b>Speaker</b>.<br>- Possible<b> interlocutors</b> (for the following episodes only: bb: S01E04, S01E06, S02E03, S02E04; got: S01E03, S01E07, S01E08; hoc: S01E01, S01E07, S01E11).<br><br>All timestamps are expressed in seconds and are valid for the video files extracted from the commercial DVDs (PAL 25 FPS), with recaps (unannotated) included at the beginning of the <i>House of Cards</i> episodes.<br>In you are interested in the textual content of the dataset, please consider using our text recovering tool on GitHub:<br>https://github.com/bostxavier/Serial-Speakers<br><b> </b><br>

**三部带手动标注的电视剧数据集** - 绝命毒师(Breaking Bad):第1季至第5季,对应数据文件为bb.json - 权力的游戏(Game of Thrones):第1季至第7季,对应数据文件为got.json - 纸牌屋(House of Cards):第1季至第2季,对应数据文件为hoc.json 上述三个文件均采用JSON格式,存储电视剧标注数据。 每部电视剧通过其**名称(name)**进行标识。 一部电视剧包含以**ID(id)**为标识的**季(seasons)**。 每个季由以**ID(id)**、**标题(title)**、**时长(duration)**及**帧率(fps)**为标识的**集(episodes)**组成。 每一集包含两类基础**数据(data)**:**场景(scenes)**与**语音片段(speech segments)**。 场景以**起始点(start)**定义,且由**镜头(shots)**组成(仅第1季包含镜头)。 镜头的定义如下: - 起始与终止时间点 - 重复出现的镜头ID 语音片段的定义如下: - 起始与终止时间点 - 文本内容(因版权原因已加密) - 说话人(Speaker) - 可选对话对象(interlocutors),仅适用于以下剧集集数:绝命毒师:S01E04、S01E06、S02E03、S02E04;权力的游戏:S01E03、S01E07、S01E08;纸牌屋:S01E01、S01E07、S01E11。 所有时间戳均以秒为单位,适用于从商业DVD提取的PAL制式25 fps视频文件,且包含纸牌屋剧集开头的回顾片段(未标注)。 若您需要获取该数据集的文本内容,可使用我们在GitHub上发布的文本恢复工具:https://github.com/bostxavier/Serial-Speakers
提供机构:
figshare
创建时间:
2019-11-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作