NonNews-BBC
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7255081
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is released as part of the paper "Exploring Pre-Trained Neural Representations for Audio Topic Segmentation" and it includes embeddings extracted from non-overlapping 1-second audio portions from various magazine-style shows (i.e. non news) from BBC radio channels. Each audio file has been anonymised by labelling it with a randomised label. We release 7 type of embeddings coming from different pre-trained architectures and, where applicable, for 3 of these embedding type we further release 7 sub-folders containing the actual audio embeddings for each file. These sub-folders contain the embeddings obtained with different pooling strategies described in the original paper: as openL3, Wav2Vec2 and CREPE are trained to output multiple embeddings for each 1-second frame the pooling strategies reduce those multiple embeddings to one per frame. The pooling strategy can have a huge impact on final model performance.
Finally, we release the ground truth for each audio file as a pickle file containing a python dictionary, where the keys are the same identifiers used to name the embeddings (without the .npy extension). The ground truth were produced by manual annotators and they represent whether each 1-second frame is a topic boundary (i.e. a topic shift happens in or at the end of the frame) or not, where 1 corresponds to topic boundary and 0 to in-topic frames (i.e. non-boundary).
Below we describe in more details the structure of our dataset, by describing the content of each sub-folder and file:
- NonNewsUniform1: The parent directory containing all the other subdirectories and files. Uniform 1 refers to the initial segmentation method being that of dividing the original audio file in non-overlapping 1-second chunks.
-- openL3: a folder of of folders, one for each pooling strategy, each containing numpy arrays, one for each audio source file, including the relative openl3 embeddings.
-- x-vectors: a folder of numpy arrays, one for each audio source file, including the relative x-vector embeddings.
-- ecapa: a folder of numpy arrays, one for each audio source file, including the relative x-vector embeddings.
-- Wav2Vec: a folder of folders, one for each pooling strategy, each containing numpy arrays, one for each audio source file, including the relative wav2vec2 embeddings.
-- CREPE: a folder of folders, one for each pooling strategy, each containing numpy arrays, one for each audio source file, including the relative CREPE embeddings.
-- prosodic: a folder of numpy arrays, one for each audio source file, including the relative prosodic embeddings.
-- mfcc: a folder of numpy arrays, one for each audio source file, including the relative MFCC embeddings.
-- labs_dict.pkl: a pickled file (Protocol version 5) containing the topic segmentation ground truth. It consists of a dictionary where each key is the identifier assigned to the original audio file- and the value associated is a list of 0 and 1s of length equal to the corresponding embedding containing the same name. The elements in each list indicate whether the corresponding embedding constitutes a topic boundary (1) or not (0) and it is therefore used to train and test a topic segmentation model. For example, in the Non-news dataset the key "vuci12" contains the ground truth for all the numpy array files containing the identifier "vuci12" in the same dataset (e.g. openl3/_mean_/vuci12.npy, prosodic/vuci12.npy, etc.).
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
创建时间:
2022-11-24



