Google Audioset 音频数据集

Name: Google Audioset 音频数据集
Creator: 帕依提提
License: 暂无描述

帕依提提2024-03-04 收录

下载链接：

https://www.payititi.com/opendatasets/show-26359.html

下载链接

链接失效反馈

官方服务：

资源简介：

AudioSet 包含了 632 类的音频类别以及 2084320 条人工标记的每段 10 秒长度的声音剪辑片段（片段来自 YouTube 视频）。音频本体 (ontology) 被确定为事件类别的一张层级图，覆盖大范围的人类与动物声音、乐器与音乐流派声音、日常的环境声音。通过发布 AndioSet，我们希望能为音频事件检测提供一个常见的、实际的评估任务，也是声音事件的综合词汇理解的一个开端。大型数据收集该数据集收集了所有与我们合作的人类标注者从 YouTube 视频中识别的声音。我们基于 YouTube 元数据和基于内容的搜索来挑选需要标注的片段。在我们的音频本体中，得到的数据集在音频事件类上有极好的覆盖。图：每类别样本的数量在我们 ICASSP 2017 论文中音频本体和数据集的构建有更加具体的描述。你可以在我们 GitHub 知识库中为音频本体作更多补充。数据集与机器提取特征（machine-extracted features）已可以下载 https://github.com/audioset/ontology 此项研究成果已经以论文的形式发表在了 IEEE ICASSP 2017 大会上：论文：Audio Set: An ontology and human-labeled dataset for audio events 摘要音频事件识别，类似人类识别音频事件并进行关联的能力，是机器感知研究中的一个新生问题。类似问题，比如识别图像中的目标研究已经从广泛数据集——主要是 ImageNet 中获益匪浅。这篇论文描述了大规模人工标记音频事件数据组 Audio Set 的建造过程。该数据组旨在弥合图片和音频研究之间存在的鸿沟。使用文献和人工管理指导下精细建构起来的 635 个音频类别的层级本体，我们搜集了源自人工标记者的大量数据，探查特定音频类别（10 秒时长的 YouTube 音频片段）的现状。建议使用基于元数据、文本（比如链接）以及内容分析的搜索对这些片段进行标记。结果，我们获得了一个宽度和大小都史无前例的数据集，我们希望它能实质上促进高水平音频事件识别程序的发展。 AudioSet提供了两种格式： csv文件，包括音频所在的YouTube视频的ID，开始时间，结束时间以及标签(可能是多标签) 128维的特征，采样率为1Hz，也就是把音频按秒提取为128维特征。特征是使用VGGish模型来提取的，VGGish下载地址为 TensorFlow models GitHub repository，可以使用该模型提取我们自己的数据。VGGish也是用来提取YouTube-8M的。这些数据被存储为.tfrecord格式。 128维特征的下载地址(基于所在地) http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz http://storage.googleapis.com/eu_audioset/youtube_corpus/v1/features/features.tar.gz http://storage.googleapis.com/asia_audioset/youtube_corpus/v1/features/features.tar.gz 其中，label的类型映射对应，可以通过class_labels_indices.csv了解。 AudioSet还提供了Starter Code用来在AudioSet上进行训练，以便作为baseline，这代码也是用来在YouTube8M上训练的，代码可以在Starter Code下载更多的细节，可以在Google的论坛AudioSet_User了解。

AudioSet contains 632 audio categories and 2,084,320 10-second manually labeled audio clips sourced from YouTube videos. The audio ontology is defined as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and music genre sounds, as well as everyday environmental sounds. By releasing AudioSet, we aim to provide a common, practical evaluation task for audio event detection, as well as a starting point for comprehensive vocabulary understanding of sound events. Large-scale Data Collection This dataset collects sounds identified by all collaborating human annotators from YouTube videos. We select clips for annotation based on YouTube metadata and content-based searches. The resulting dataset boasts excellent coverage of audio event categories within our audio ontology. Figure: Number of Samples per Category The construction of the audio ontology and dataset is described in greater detail in our ICASSP 2017 paper. You can contribute additional content to the audio ontology in our GitHub repository. The dataset and machine-extracted features are available for download at https://github.com/audioset/ontology This research has been published as a paper at the IEEE ICASSP 2017 conference: Paper: Audio Set: An ontology and human-labeled dataset for audio events ABSTRACT Audio event recognition, which mimics humans' ability to identify and associate audio events, is an emerging research topic in machine perception. Similar problems, such as object recognition in images, have greatly benefited from large-scale datasets, most prominently ImageNet. This paper details the construction of Audio Set, a large-scale manually labeled audio event dataset. This dataset aims to bridge the gap between image and audio research domains. Using a hierarchical ontology of 635 audio categories meticulously built with guidance from literature review and human curation, we collected a large volume of data from human annotators, investigating the current state of specific audio categories via 10-second YouTube audio clips. We recommend annotating these clips through searches guided by metadata, text (e.g., links), and content analysis. As a result, we obtained a dataset of unprecedented scale and breadth, and we hope it will substantially advance the development of high-performance audio event recognition systems. AudioSet provides two formats: 1. CSV files, which include the YouTube video ID, start time, end time, and labels (potentially multi-labels) of the audio. 2. 128-dimensional features, extracted at a 1 Hz sampling rate, meaning 128-dimensional features are extracted from the audio on a per-second basis. These features are extracted using the VGGish model, whose download address is the TensorFlow Models GitHub repository. VGGish can be used to extract features for our own custom data, and was also used to extract features for YouTube-8M. These data are stored in .tfrecord format. Download links for the 128-dimensional features (by region): http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz http://storage.googleapis.com/eu_audioset/youtube_corpus/v1/features/features.tar.gz http://storage.googleapis.com/asia_audioset/youtube_corpus/v1/features/features.tar.gz The label type mapping can be viewed via the class_labels_indices.csv file. AudioSet also provides Starter Code for training on AudioSet to serve as a baseline, which was also used for training on YouTube-8M. The code can be downloaded via the Starter Code link. For more details, please refer to Google's AudioSet_User forum.

提供机构：

帕依提提

搜集汇总

数据集介绍

背景与挑战

背景概述

Google Audioset 音频数据集是一个大规模的人工标记音频数据集，包含632个音频类别和超过200万条10秒长度的音频片段，这些片段源自YouTube视频，覆盖了广泛的声音类型如人类与动物声音、乐器声音和环境声音。该数据集旨在为音频事件检测提供通用的评估基准，并推动音频事件识别技术的发展。

以上内容由遇见数据集搜集并总结生成