five

kiahh/NSynth

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/kiahh/NSynth
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 dataset_info: features: - name: id dtype: string - name: note dtype: int64 - name: note_str dtype: string - name: instrument dtype: int64 - name: instrument_str dtype: string - name: pitch dtype: int64 - name: velocity dtype: int64 - name: sample_rate dtype: int64 - name: qualities sequence: int64 - name: qualities_str sequence: string - name: instrument_family dtype: int64 - name: instrument_family_str dtype: string - name: instrument_source dtype: int64 - name: instrument_source_str dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 129511245 num_examples: 289205 - name: validation num_bytes: 5679042 num_examples: 12678 - name: test num_bytes: 1830670 num_examples: 4096 download_size: 25233566634 dataset_size: 137020957 task_categories: - audio-to-audio - audio-classification tags: - music pretty_name: NSynth size_categories: - 100K<n<1M --- # Dataset Card for NSynth <!-- Provide a quick summary of the dataset. --> The NSynth dataset is an audio dataset containing over 300,000 musical notes across over 1000 commercially-sampled instruments, distinguished by pitch, timbre, and envelope. Each recording was made by playing and holding a musical note for three seconds and letting it decay for one second. The collection of four-second recordings ranges over every pitch on a standard MIDI piano (or as many as possible for the given instrument), played at five different velocities. This dataset was created as an attempt to establish a high-quality entry point into audio machine learning, in response to the surge of breakthroughs in generative modeling of images due to the abundance of approachable image datasets (MNIST, CIFAR, ImageNet). NSynth is meant to be both a benchmark for audio ML and a foundation to be expanded on with future datasets. ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> Since some instruments are not capable of producing all 88 pitches in the MIDI piano's range, there is an average of 65.4 pitches per instrument. Furthermore, the commercial sample packs occasionally contain duplicate sounds across multiple velocities, leaving an average of 4.75 unique velocities per pitch. Each of the notes is annotated with three additional pieces of information based on a combination of human evaluation and heuristic algorithms: 1. Source: The method of sound production for the note’s instrument. This can be one of `acoustic` or `electronic` for instruments that were recorded from acoustic or electronic instruments, respectively, or `synthetic` for synthesized instruments. |Index|ID| |:----|:----| |0|acoustic| |1|electronic| |2|synthetic| 2. Family: The high-level family of which the note’s instrument is a member. Each instrument is a member of exactly one family. See the complete list of families and their frequencies by source below. |**Index**|**ID**| |:---|:---| |0|bass| |1|brass| |2|flute| |3|guitar| |4|keyboard| |5|mallet| |6|organ| |7|reed| |8|string| |9|synth_lead| |10|vocal| |**Family**|**Acoustic**|**Electronic**|**Synthetic**|**Total**| |:----|:----|:----|:----|:----| |Bass|200|8387|60368|68955| |Brass|13760|70|0|13830| |Flute|6572|35|2816|9423| |Guitar|13343|16805|5275|35423| |Keyboard|8508|42645|3838|54991| |Mallet|27722|5581|1763|35066| |Organ|176|36401|0|36577| |Reed|14262|76|528|14866| |String|20510|84|0|20594| |Synth Lead|0|0|5501|5501| |Vocal|3925|140|6688|10753| |**Total**|108978|110224|86777|305979| 3. Qualities: Sonic qualities of the note. See below for descriptions of the qualities, and [here](https://magenta.tensorflow.org/datasets/nsynth#quality-co-occurrences) for information on co-occurences between qualities. |**Index**|**ID**|**Description**| |:----|:----|:----| |0|`bright`|A large amount of high frequency content and strong upper harmonics.| |1|`dark`|A distinct lack of high frequency content, giving a muted and bassy sound. Also sometimes described as ‘Warm’.| |2|`distortion`|Waveshaping that produces a distinctive crunchy sound and presence of many harmonics. Sometimes paired with non-harmonic noise.| |3|`fast_decay`|Amplitude envelope of all harmonics decays substantially before the ‘note-off’ point at 3 seconds.| |4|`long_release`|Amplitude envelope decays slowly after the ‘note-off’ point, sometimes still present at the end of the sample 4 seconds.| |5|`multiphonic`|Presence of overtone frequencies related to more than one fundamental frequency.| |6|`nonlinear_env`|Modulation of the sound with a distinct envelope behavior different than the monotonic decrease of the note. Can also include filter envelopes as well as dynamic envelopes.| |7|`percussive`|A loud non-harmonic sound at note onset.| |8|`reverb`|Room acoustics that were not able to be removed from the original sample.| |9|`tempo-synced`|Rhythmic modulation of the sound to a fixed tempo.| ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Homepage:** https://magenta.tensorflow.org/datasets/nsynth - **Paper:** https://arxiv.org/abs/1704.01279 ## Uses <!-- Address questions around how the dataset is intended to be used. --> This dataset has seen much use in models for generating audio, and some of these models have even been used by high-profile artists. Another obvious application of the dataset could be for classification (identifying instruments or perhaps even qualities of music, which could be useful in things like music recommendation). See [here](https://colab.research.google.com/drive/16u5dvqWxA7o9S0iC6E8B3S77piFZ0BYL#scrollTo=Q5BGqIb87Pek&uniqifier=2) one such example (which is a work in progress). ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset has three splits: * Train: A training set with 289,205 examples. Instruments do not overlap with valid or test. * Valid: A validation set with 12,678 examples. Instruments do not overlap with train. * Test: A test set with 4,096 examples. Instruments do not overlap with train. See below for descriptions of the features. |Feature|Type|Description| |:----|:----|:----| |note|`int64`|A unique integer identifier for the note.| |note_str|`str`|A unique string identifier for the note in the format `<instrument_str>-<pitch>-<velocity>`.| |instrument|`int64`|A unique, sequential identifier for the instrument the note was synthesized from.| |instrument_str|`str`|A unique string identifier for the instrument this note was synthesized from in the format `<instrument_family_str>-<instrument_production_str>-<instrument_name>`.| |pitch|`int64`|The 0-based MIDI pitch in the range \[0, 127\].| |velocity|`int64`|The 0-based MIDI velocity in the range \[0, 127\].| |sample_rate|`int64`|The samples per second for the audio feature.| |qualities|`[int64]`|A binary vector representing which sonic qualities are present in this note.| |qualities_str|`[str]`|A list IDs of which qualities are present in this note selected from the sonic qualities list.| |instrument_family|`int64`|The index of the instrument family this instrument is a member of.| |instrument_family_str|`str`|The ID of the instrument family this instrument is a member of.| |instrument_source|`int64`|The index of the sonic source for this instrument.| |instrument_source_str|`str`|The ID of the sonic source for this instrument.| |audio|`{'path': str, 'array': [float], 'sampling_rate': int64}`|A dictionary containing a path to the corresponding audio file, a list of audio samples represented as floating point values in the range \[-1,1\], and the sampling rate.| An example instance generated with the loading script (note that this differs from the example instance on the homepage, as the script integrates the audio into the respective JSON files): ``` {'note': 84147, 'note_str': 'bass_synthetic_033-035-050', 'instrument': 417, 'instrument_str': 'bass_synthetic_033', 'pitch': 35, 'velocity': 50, 'sample_rate': 16000, 'qualities': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'qualities_str': ['dark'], 'instrument_family': 0, 'instrument_family_str': 'bass', 'instrument_source': 2, 'instrument_source_str': 'synthetic', 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/335ef507846fb65b0b87154c22cefd1fe87ea83e8253ef1f72648a3fdfac9a5f/nsynth-test/audio/bass_synthetic_033-035-050.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 16000} } ``` ## Potential Shortcomings There are quite a few family-source pairings with little or no representation. While this is understandable in some cases - no acoustic Synth Lead, for instance - it may be problematic in others (no synthetic brass, strings, nor organ, < 100 electronic brass, flute, reed, and string samples). This can be particularly troublesome in classification problems, as there may not be sufficient data for a model to correctly distinguish between sources for a particular family of instruments. In music generation, on the other hand, these disparities may yield a bias toward the use of one source over others for a given family. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> ``` Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders." 2017. ``` **BibTeX:** ``` @misc{nsynth2017, Author = {Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Douglas Eck and Karen Simonyan and Mohammad Norouzi}, Title = {Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders}, Year = {2017}, Eprint = {arXiv:1704.01279}, } ``` ## Dataset Card Authors John Gillen

许可证:CC BY 4.0 dataset_info: features: - name: id dtype: string - name: note dtype: int64 - name: note_str dtype: string - name: instrument dtype: int64 - name: instrument_str dtype: string - name: pitch dtype: int64 - name: velocity dtype: int64 - name: sample_rate dtype: int64 - name: qualities sequence: int64 - name: qualities_str sequence: string - name: instrument_family dtype: int64 - name: instrument_family_str dtype: string - name: instrument_source dtype: int64 - name: instrument_source_str dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 129511245 num_examples: 289205 - name: validation num_bytes: 5679042 num_examples: 12678 - name: test num_bytes: 1830670 num_examples: 4096 download_size: 25233566634 dataset_size: 137020957 task_categories: - audio-to-audio - audio-classification tags: - music pretty_name: NSynth size_categories: - 100K<n<1M --- # NSynth 数据集卡片 <!-- 简要介绍该数据集。 --> NSynth 数据集是一款音频数据集,涵盖超过1000种商业采样乐器的30余万条音乐音符,通过音高、音色与包络进行区分。每条录音均为演奏并保持音符3秒,随后允许其余韵衰减1秒,总时长为4秒。数据集覆盖标准MIDI钢琴的全部音高(或受限于乐器可支持的最大音高范围),并以5种不同力度进行演奏。 本数据集的创建旨在为音频机器学习领域建立一个高质量的入门基准,以应对因大量易用图像数据集(如MNIST、CIFAR、ImageNet)推动的图像生成建模突破浪潮。NSynth 既可作为音频机器学习的基准测试集,也可作为未来数据集扩展的基础框架。 ### 数据集详细描述 <!-- 详细说明数据集的具体内容。 --> 由于部分乐器无法生成MIDI钢琴的全部88个音高,每件乐器平均可覆盖65.4个音高。此外,商业采样包中偶尔存在不同力度下的重复声音,导致每个音高平均对应4.75种独特力度。 每个音符均结合人工评估与启发式算法,额外标注了三类信息: 1. **来源(Source)**:乐器的发声方式,可分为三类:`acoustic`(原声)对应原声乐器录音、`electronic`(电子)对应电子乐器录音、`synthetic`(合成)对应合成乐器。 |索引|ID| |:----|:----| |0|acoustic(原声)| |1|electronic(电子)| |2|synthetic(合成)| 2. **家族(Family)**:乐器所属的高级分类,每件乐器仅属于一个家族。完整的家族列表及其按来源划分的频次详见下表。 |**索引**|**ID**| |:---|:---| |0|bass(贝斯)| |1|brass(铜管乐)| |2|flute(长笛)| |3|guitar(吉他)| |4|keyboard(键盘乐器)| |5|mallet(打击乐器)| |6|organ(管风琴)| |7|reed(簧片乐器)| |8|string(弦乐)| |9|synth_lead(合成主音)| |10|vocal(人声)| |**家族**|**原声**|**电子**|**合成**|**总计**| |:----|:----|:----|:----|:----| |贝斯|200|8387|60368|68955| |铜管乐|13760|70|0|13830| |长笛|6572|35|2816|9423| |吉他|13343|16805|5275|35423| |键盘乐器|8508|42645|3838|54991| |打击乐器|27722|5581|1763|35066| |管风琴|176|36401|0|36577| |簧片乐器|14262|76|528|14866| |弦乐|20510|84|0|20594| |合成主音|0|0|5501|5501| |人声|3925|140|6688|10753| |**总计**|108978|110224|86777|305979| 3. **音色属性(Qualities)**:音符的声学特性。各属性的描述详见下表,属性间的共现关系可参考[此处](https://magenta.tensorflow.org/datasets/nsynth#quality-co-occurrences)。 |**索引**|**ID**|**描述**| |:----|:----|:----| |0|`bright`(明亮)|包含大量高频成分与强高次谐波。| |1|`dark`(暗沉)|显著缺乏高频成分,呈现柔和低沉的音色,有时也被称为“温暖”。| |2|`distortion`(失真)|通过波形整形产生标志性的脆响音效,并伴随大量谐波,有时搭配非谐波噪声。| |3|`fast_decay`(快速衰减)|所有谐波的幅度包络在3秒的“音符释放”点前已大幅衰减。| |4|`long_release`(长余韵)|音符释放后幅度包络衰减缓慢,有时在4秒的采样结尾仍可被检测到。| |5|`multiphonic`(多音)|存在与多个基频相关的泛音频率。| |6|`nonlinear_env`(非线性包络)|声音的调制具有与音符单调衰减不同的包络特性,可包括滤波包络与动态包络。| |7|`percussive`(打击感)|音符起始处存在响亮的非谐波声音。| |8|`reverb`(混响)|原始采样中无法移除的房间声学效果。| |9|`tempo-synced`(节奏同步)|声音以固定节奏进行调制。| ### 数据集来源 <!-- 提供数据集的基础链接。 --> - **主页**:https://magenta.tensorflow.org/datasets/nsynth - **论文**:https://arxiv.org/abs/1704.01279 ## 用途 <!-- 解答数据集的预期用途相关问题。 --> 本数据集已广泛应用于音频生成模型,部分模型甚至被知名艺术家采用。该数据集的另一典型应用场景为分类任务(例如识别乐器或音乐音色属性,可用于音乐推荐等场景)。参考[此处](https://colab.research.google.com/drive/16u5dvqWxA7o9S0iC6E8B3S77piFZ0BYL#scrollTo=Q5BGqIb87Pek&uniqifier=2)可查看相关示例(仍在开发中)。 ## 数据集结构 <!-- 本节提供数据集字段的描述,以及数据集结构的额外信息,例如划分的创建标准、数据点间的关系等。 --> 本数据集包含三个划分: * 训练集(Train):包含289,205条样本,其乐器与验证集、测试集无重叠。 * 验证集(Valid):包含12,678条样本,其乐器与训练集无重叠。 * 测试集(Test):包含4,096条样本,其乐器与训练集无重叠。 各特征的描述详见下表。 |特征|类型|描述| |:----|:----|:----| |note|`int64`|音符的唯一整数标识符。| |note_str|`str`|格式为`<instrument_str>-<pitch>-<velocity>`的音符唯一字符串标识符。| |instrument|`int64`|被采样乐器的唯一顺序标识符。| |instrument_str|`str`|格式为`<instrument_family_str>-<instrument_production_str>-<instrument_name>`的乐器唯一字符串标识符。| |pitch|`int64`|范围为[0, 127]的0基准MIDI音高。| |velocity|`int64`|范围为[0, 127]的0基准MIDI力度。| |sample_rate|`int64`|音频特征的每秒采样数。| |qualities|`[int64]`|表示该音符包含哪些音色属性的二进制向量。| |qualities_str|`[str]`|从音色属性列表中选取的该音符包含的属性ID列表。| |instrument_family|`int64`|该乐器所属家族的索引。| |instrument_family_str|`str`|该乐器所属家族的ID。| |instrument_source|`int64`|该乐器的声学来源索引。| |instrument_source_str|`str`|该乐器的声学来源ID。| |audio|`{'path': str, 'array': [float], 'sampling_rate': int64}`|一个字典,包含对应音频文件的路径、以[-1,1]范围内浮点数表示的音频采样数组,以及采样率。| 以下为使用加载脚本生成的示例实例(注意本示例与主页上的示例不同,因为加载脚本将音频集成到了对应的JSON文件中): {'note': 84147, 'note_str': 'bass_synthetic_033-035-050', 'instrument': 417, 'instrument_str': 'bass_synthetic_033', 'pitch': 35, 'velocity': 50, 'sample_rate': 16000, 'qualities': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'qualities_str': ['dark'], 'instrument_family': 0, 'instrument_family_str': 'bass', 'instrument_source': 2, 'instrument_source_str': 'synthetic', 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/335ef507846fb65b0b87154c22cefd1fe87ea83e8253ef1f72648a3fdfac9a5f/nsynth-test/audio/bass_synthetic_033-035-050.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 16000} } ## 潜在缺陷 本数据集存在不少家族-来源配对的样本量极少甚至无样本的情况。尽管部分情况情有可原——例如不存在原声合成主音乐器——但其他情况可能存在问题(例如无合成铜管、弦乐与管风琴,电子铜管、长笛、簧片乐器与弦乐样本不足100条)。这在分类任务中尤为棘手,因为模型可能缺乏足够数据来准确区分特定乐器家族的来源。而在音乐生成任务中,这类分布不均可能导致模型偏向于为特定乐器家族使用某一类来源。 ## 引用 <!-- 若有介绍该数据集的论文或博客文章,需在此处提供APA与BibTeX格式的引用信息。 --> Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders." 2017. **BibTeX:** @misc{nsynth2017, Author = {Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Douglas Eck and Karen Simonyan and Mohammad Norouzi}, Title = {Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders}, Year = {2017}, Eprint = {arXiv:1704.01279}, } ## 数据集卡片作者 John Gillen
提供机构:
kiahh
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作