Webis-YouTube8MA-18
收藏webis.de2025-01-15 收录
下载链接:
https://webis.de/data/Webis-YouTube8MA-18
下载链接
链接失效反馈官方服务:
资源简介:
<p>We used the YouTube Data API to augment the YouTube 8M corpus by crawling a variety of meta data for the videos.</p><p>First point of interest was the "video resource," which comprises data about the video, such as the video's title, description, uploader name, tags, view count, and more. Also included in the meta data is whether comments have been left for the video. If so, we downloaded them as well, including information about their authors, likes, dislikes, and responses.</p><p>There is no property which specifies a video's language, since this information is not mandatory when uploading a video. Also, the API provides only information about the available captions, but not the captions themselves. Only the uploader of a video is given access to its captions via the API; we extracted them using youtube-dl. For each video, all manually created captions were downloaded, and auto-generated captions in the "default" language and English. The "default" auto-generated caption gives perhaps the only hint at a video's original language.</p><p>Finally, we downloaded all thumbnails used to advertise a video, which are not available via the API, but only via a canonical URL. Our corpus provides the possibility to recreate the way a video is presented on YouTube (meta data and thumbnail), what the actual content is ((sub)titles and descriptions), and how its viewers reacted (comments).</p>
本研究利用YouTube数据API,对YouTube 8M语料库进行了增强,通过爬取视频的各类元数据以丰富其内容。
首当其冲的是“视频资源”,其中包含有关视频的信息,诸如视频标题、描述、上传者姓名、标签、观看次数等。元数据还包括视频是否被留下评论,若存在评论,我们亦下载了评论内容,包括评论者的信息、点赞数、踩数及回复。
由于上传视频时语言信息并非强制要求,因此不存在指定视频语言的属性。此外,API仅提供可用的字幕信息,而不包括字幕本身。只有视频的上传者通过API能够访问其字幕;我们通过youtube-dl提取了这些字幕。对于每个视频,我们下载了所有手动创建的字幕,以及默认语言和英语的自动生成字幕。默认自动生成的字幕或许为视频的原始语言提供了唯一的线索。
最后,我们还下载了用于推广视频的所有缩略图,这些缩略图无法通过API获取,仅能通过规范URL访问。我们的语料库提供了重建视频在YouTube上呈现方式的可能性(元数据和缩略图),视频的实际内容(子标题和描述),以及观众的反应(评论)。
提供机构:
Webis Group



