YouTube NSI Captioning Dataset
收藏Mendeley Data2024-06-25 更新2024-06-28 收录
下载链接:
https://zenodo.org/records/10681804
下载链接
链接失效反馈官方服务:
资源简介:
Version 1.0, March 2024 Created by Lloyd May (1), Keita Ohshiro (2,3), Khang Dang (2,3), Sripathi Sridhar (2,3), Jhanvi Pai (2,3), Magdalena Fuentes (4), Sooyeon Lee (3), Mark Cartwright (2,3,4) Center for Computer Research in Music and Acoustics, Stanford University Sound Interaction and Computing Lab, New Jersey Institute of Technology Department of Informatics, New Jersey Institute of Technology Music and Audio Research Lab, New York University Publication If using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset: May, L., Ohshiro, K., Dang, K., Sridhar, S., Pai, J., Fuentes, M., Lee, S., Cartwright, M. Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTube. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI), 2024. Description The YouTube NSI Captioning Dataset was developed to analyze the contemporary and historical state of non-speech information (NSI) captioning on YouTube. NSI includes information about non-speech sounds such as environmental sounds, sound effects, incidental sounds, and music, as well as additional narrative information and extra-speech information (ESI), which gives context to spoken or signed language such as manner of speech (e.g. "[Whispering] Oh no") or speaker label (e.g., "[Juan] Oh no"). The dataset contains measures of estimated and annotated NSI in the captions of two different samples of videos: a popular video sample and a studio video sample. The aim of the popular sample is to understand the captioning practices in a broad spectrum of popular, impactful videos on YouTube. In contrast, the aim of the studio sample is to examine captioning practices among the top-tier production houses, often viewed as industry benchmarks due to their influence and vast resources available for accessibility. Using the YouTube API, we queried for videos in these two samples for each month from 2013 to 2022. We then estimated which captions contain NSI by searching for non-alphanumeric symbols that are indicative of NSI, e.g., "[" and "]" (see Section 3.2 of the paper for a full list). In addition, the research team manually annotated which captions have NSI from a subset of approximately 1800 videos from years 2013, 2018, and 2022. Please see the Section 3.3 of the paper for details of the annotation process. The resulting YouTube NSI Captioning Dataset consists of NSI information from ~715k videos containing ~273M lines of captions, ~ 6M of which are estimated instances of NSI. These videos span 10 years and 21 topics. The annotated subset consists of 1799 videos with a total of ~36k annotated captions lines, ~114k of which are instances of NSI annotated on 7 different categories. These videos span 3 years (2013, 2018, and 2022) and 20 YouTube-assigned topics. Each video was annotated by two annotators along with the consensus annotation. The dataset contains the links to the YouTube videos, video metadata from the YouTube API, and measures of both estimated and annotated NSI. Due to copyright concerns, we are only publicly releasing data consisting of summary NSI measures for each video. If you need access to the raw data used to create these summary NSI measures, contact Mark Cartwright at mark.cartwright@njit.edu. Files estimated_full_set_aggregate.csv : Data file containing the full set of video data with measures of estimated NSI. annotated_subset_aggregate.csv : Data file containing the smaller annotated subset of video data with measures of both annotated and estimated NSI. Columns The following columns are present in both data files. video_id : The YouTube video ID year : The year associated with the time period from which the video was sampled. sample : The sample which the video is from (i.e., popular or studio) sampling_period_start_date : The start date of the time period from which the video was sampled. sampling_period_end_date : The end date of the time period from which the video was sampled. caption_type : This can take one of three values: auto which indicates a caption was provided by YouTube's automated caption system, manual which indicates a caption was provided by the uploader, or none which indicates that no captions are present for the video. duration_minutes : The duration of the video in minutes. channel_id : The ID that YouTube uses to uniquely identify the channel. published_datetime : The date and time at which the video was published on YouTube. youtube_topics : The YouTube-provided list of Wikipedia URLs that provide a description of the video's content. category_id : The YouTube video category associated with the video. view_count : The count of views on YouTube at the time of sampling (Spring 2023). like_count : The count of likes on YouTube at the time of sampling (Spring 2023). comment_count : The count of comments on YouTube at the time of sampling (Spring 2023). high_level_topics : List of topics at a higher semantic level than youtube_topics that provide a description of the video's content. See paper for details on the mapping between youtube_topics and high_level_topics. <nsi_type>__<nsi_measure> : The remainder of the columns take this form with the values listed below. Values for <nsi_type>: estimated_nsi : This NSI type is an estimation of NSI based on the presence of particular non-alphanumeric characters that are indicative of NSI as described in Section 3.2 of the paper. general_nsi (only in annotated_subset_aggregate.csv) : The most general of NSI types that is inclusive of music_nsi, environmental_nsi, additionalnarrativ_nsi, and quotedspeech_nsi. All of these NSI types are included in the calculation of measures associated with general_nsi. Note that misc_nsi and nonenglish_captions are not included as those may or may not contain NSI, and thus, we opt for precision over recall. Not present for the unlabeled music_nsi (only in annotated_subset_aggregate.csv) : Any genre of music, whether diegetic or not. environmental_nsi (only in annotated_subset_aggregate.csv) : Environmental sounds, sound effects, and incidental sounds, i.e., non-music and non-speech sounds. This includes non-verbal vocalizations like laughter, grunts, and crying, provided they aren't used to modify speech. extraspeech_nsi (only in annotated_subset_aggregate.csv) : Extra-speech Information (ESI), i.e., text that gives added context to spoken or signed language. additionalnarrative_nsi (only in annotated_subset_aggregate.csv) : Additional narrative information in the form of descriptive text that doesn't pertain directly to sounds. quotedspeech_nsi (only in annotated_subset_aggregate.csv) : Quoted Speech Captions containing internal quotation marks. misc_nsi (only in annotated_subset_aggregate.csv) : Unsure, misc, or ambiguous, i.e., instances where the appropriate label is unclear or the caption doesn't fit current categories. nonenglish_captions (only in annotated_subset_aggregate.csv) : Captions not written in English and thus have uncertain NSI status. Values for <nsi_measure>: count : The number of captions identified as containing NSI of the specified type in the video. presence : Indication of whether there is NSI of the specified type present in the video. 1 if present (e.g., count > 0), 0 if not present (e.g., count==0). count_per_minute : A measure of the density of NSI captions. count_per_min = count / duration_minutes count_per_minute_if_present : If presence==1, then count_per_minute, else, NaN. This is used for computing the aggregate CPMIP measure, which as discussed in the paper is intended to be a measure of the quality of NSI captions based on the assumption that more frequently captioned NSI within a video is an indicator of better NSI captioning. See Section 5 of the paper for details. Conditions of use Dataset created by Lloyd May, Keita Ohshiro, Khang Dang, Sripathi Sridhar, Jhanvi Pai, Magdalena Fuentes, Sooyeon Lee, and Mark Cartwright The YouTube NSI Captioning Dataset dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://creativecommons.org/licenses/by/4.0/ Feedback Please help us improve YouTube NSI Captioning Dataset by sending your feedback to: Mark Cartwright: mark.cartwright@njit.edu In case of a problem, please include as many details as possible.
创建时间:
2024-03-03



