five

TashaSkyUp/audio-quality-dataset-nfe4-30-step2

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TashaSkyUp/audio-quality-dataset-nfe4-30-step2
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Audio Quality Dataset NFE 4-30 Step 2 license: other size_categories: - 1K<n<10K tags: - synthetic - audio - spectrogram - text-to-speech - weak-labels - synthetic-speech - research language: - en --- # Audio Quality Dataset: NFE 4-30 Step 2 ## Overview This dataset publishes synthetic speech artifacts and derived spectrograms used for repo-local audio-quality experiments. At a glance: - `2800` synthetic runs - `200` short English prompt sentences - `14` NFE settings: `4, 6, 8, ..., 30` - fixed seed `1024` Each row represents one synthetic run and includes: - prompt text - raw synthetic WAV - processed synthetic WAV - spectrogram PNG - NFE value - procedural weak label Here, **NFE** means the number of LongCat inference / ODE steps used during synthesis. ## Synthetic Data Disclosure Everything in this dataset is synthetic or derived from synthetic artifacts. - The prompt text comes from a repo-local TSV used for this experiment. - Raw speech was generated with `meituan-longcat/LongCat-AudioDiT-3.5B`. - Processed speech was produced by applying one `ClearVoice` `MossFormer2_SR_48K` pass. - Spectrogram PNGs were rendered from those processed synthetic audio files. - No human speech recordings are distributed in this dataset bundle. - Weak labels are procedural NFE-derived buckets, not human annotations. ## What One Row Contains The manifest rows include: - `run_id` - `sentence_id` - `category` - `text` - `nfe` - `seed` - `weak_label` - `sentence_file` - `run_root` - `raw_output_dir` - `processed_output_dir` - `raw_wav` - `processed_wav` - `spectrogram_png` Primary files in the repo: - `manifest.tsv` - `manifest.jsonl` - `manifest_remote_ready.tsv` - `summary.json` - `sentences/` - `runs/` Under `runs/<run_id>/` you get the actual synthetic artifacts for that run. ## How The Dataset Was Created This dataset was built in three stages. ### 1. Prompt and manifest build Source prompt sheet: - `tmp/audio_quality_dataset/short_sentences_200.tsv` Prompt structure: - `20` categories - `10` sentences per category - `200` total sentences Manifest settings: - NFE start: `4` - NFE stop: `30` - NFE step: `2` - seed: `1024` ### 2. Synthetic audio generation For each manifest row: - raw synthetic speech: `LongCat-AudioDiT-3.5B` - processed synthetic speech: one `ClearVoice` `MossFormer2_SR_48K` pass ### 3. Spectrogram rendering Each processed synthetic clip was rendered as: - grayscale PNG - `1024 x 512` - fixed dB range `[-120, 0]` - `n_fft = 1024` - `hop = 256` Repo revision for this dataset export: - `064a6bd4df88b3222459350d74341933dcfda075` ## Weak Label Semantics The weak labels are procedural buckets derived only from NFE bands: - `bad_like`: `nfe <= 12` - `mid_band`: `14 <= nfe <= 22` - `good_like`: `nfe >= 24` Counts: - `bad_like`: `1000` - `mid_band`: `1000` - `good_like`: `800` These labels are **not** human perceptual judgments. They are synthetic proxy labels designed for downstream experiments. ## Relationship To The Published Model This dataset was used to train the published autoencoder: - `TashaSkyUp/audio-quality-ae-spectrogram-patches-gpu3090-best-20260409` That model uses: - `spectrogram_png` inputs only - sentence-level train/validation splitting before patch extraction It does **not** use the dataset weak labels for training. If you build new models from this dataset, splitting by `sentence_id` is the safer default to avoid text leakage across train and validation. ## Limitations This dataset should not be treated as: - a corpus of real human speech - a benchmark with human-rated quality labels - a general-purpose speech-quality dataset It primarily captures the behavior of one synthetic generation path: - LongCat synthesis - one ClearVoice post-pass - fixed spectrogram rendering settings ## Licensing And Attribution This dataset is marked `license: other` because it is a repo-local experiment export and does not assert a new standalone permissive license over the generated artifacts. Synthetic generation in this workflow depended on: - `LongCat-AudioDiT` from Meituan, specifically `meituan-longcat/LongCat-AudioDiT-3.5B` - `ClearVoice`, specifically `MossFormer2_SR_48K` This Hugging Face dataset repo is not an official upstream release of either dependency. Check upstream terms before redistributing or reusing generated artifacts at scale.
提供机构:
TashaSkyUp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作