TashaSkyUp/audio-quality-dataset-nfe4-30-step2
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TashaSkyUp/audio-quality-dataset-nfe4-30-step2
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Audio Quality Dataset NFE 4-30 Step 2
license: other
size_categories:
- 1K<n<10K
tags:
- synthetic
- audio
- spectrogram
- text-to-speech
- weak-labels
- synthetic-speech
- research
language:
- en
---
# Audio Quality Dataset: NFE 4-30 Step 2
## Overview
This dataset publishes synthetic speech artifacts and derived spectrograms used for repo-local audio-quality experiments.
At a glance:
- `2800` synthetic runs
- `200` short English prompt sentences
- `14` NFE settings: `4, 6, 8, ..., 30`
- fixed seed `1024`
Each row represents one synthetic run and includes:
- prompt text
- raw synthetic WAV
- processed synthetic WAV
- spectrogram PNG
- NFE value
- procedural weak label
Here, **NFE** means the number of LongCat inference / ODE steps used during synthesis.
## Synthetic Data Disclosure
Everything in this dataset is synthetic or derived from synthetic artifacts.
- The prompt text comes from a repo-local TSV used for this experiment.
- Raw speech was generated with `meituan-longcat/LongCat-AudioDiT-3.5B`.
- Processed speech was produced by applying one `ClearVoice` `MossFormer2_SR_48K` pass.
- Spectrogram PNGs were rendered from those processed synthetic audio files.
- No human speech recordings are distributed in this dataset bundle.
- Weak labels are procedural NFE-derived buckets, not human annotations.
## What One Row Contains
The manifest rows include:
- `run_id`
- `sentence_id`
- `category`
- `text`
- `nfe`
- `seed`
- `weak_label`
- `sentence_file`
- `run_root`
- `raw_output_dir`
- `processed_output_dir`
- `raw_wav`
- `processed_wav`
- `spectrogram_png`
Primary files in the repo:
- `manifest.tsv`
- `manifest.jsonl`
- `manifest_remote_ready.tsv`
- `summary.json`
- `sentences/`
- `runs/`
Under `runs/<run_id>/` you get the actual synthetic artifacts for that run.
## How The Dataset Was Created
This dataset was built in three stages.
### 1. Prompt and manifest build
Source prompt sheet:
- `tmp/audio_quality_dataset/short_sentences_200.tsv`
Prompt structure:
- `20` categories
- `10` sentences per category
- `200` total sentences
Manifest settings:
- NFE start: `4`
- NFE stop: `30`
- NFE step: `2`
- seed: `1024`
### 2. Synthetic audio generation
For each manifest row:
- raw synthetic speech: `LongCat-AudioDiT-3.5B`
- processed synthetic speech: one `ClearVoice` `MossFormer2_SR_48K` pass
### 3. Spectrogram rendering
Each processed synthetic clip was rendered as:
- grayscale PNG
- `1024 x 512`
- fixed dB range `[-120, 0]`
- `n_fft = 1024`
- `hop = 256`
Repo revision for this dataset export:
- `064a6bd4df88b3222459350d74341933dcfda075`
## Weak Label Semantics
The weak labels are procedural buckets derived only from NFE bands:
- `bad_like`: `nfe <= 12`
- `mid_band`: `14 <= nfe <= 22`
- `good_like`: `nfe >= 24`
Counts:
- `bad_like`: `1000`
- `mid_band`: `1000`
- `good_like`: `800`
These labels are **not** human perceptual judgments. They are synthetic proxy labels designed for downstream experiments.
## Relationship To The Published Model
This dataset was used to train the published autoencoder:
- `TashaSkyUp/audio-quality-ae-spectrogram-patches-gpu3090-best-20260409`
That model uses:
- `spectrogram_png` inputs only
- sentence-level train/validation splitting before patch extraction
It does **not** use the dataset weak labels for training.
If you build new models from this dataset, splitting by `sentence_id` is the safer default to avoid text leakage across train and validation.
## Limitations
This dataset should not be treated as:
- a corpus of real human speech
- a benchmark with human-rated quality labels
- a general-purpose speech-quality dataset
It primarily captures the behavior of one synthetic generation path:
- LongCat synthesis
- one ClearVoice post-pass
- fixed spectrogram rendering settings
## Licensing And Attribution
This dataset is marked `license: other` because it is a repo-local experiment export and does not assert a new standalone permissive license over the generated artifacts.
Synthetic generation in this workflow depended on:
- `LongCat-AudioDiT` from Meituan, specifically `meituan-longcat/LongCat-AudioDiT-3.5B`
- `ClearVoice`, specifically `MossFormer2_SR_48K`
This Hugging Face dataset repo is not an official upstream release of either dependency. Check upstream terms before redistributing or reusing generated artifacts at scale.
提供机构:
TashaSkyUp



