bnovikov/gemma-4-e4b-audio-qa

Name: bnovikov/gemma-4-e4b-audio-qa
Creator: bnovikov
Published: 2026-04-18 01:28:50
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/bnovikov/gemma-4-e4b-audio-qa

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - audio-classification - question-answering language: - en tags: - audio - audio-understanding - multimodal - instruction-tuning - gemma pretty_name: Gemma-4 E4B Audio-QA Training Mix size_categories: - 10K<n<100K --- # Gemma-4 E4B Audio-QA Training Mix A 91k-row audio question-answering dataset assembled from four public upstream datasets, formatted as ChatML-style conversations for instruction-tuning an audio-language model. This is the exact training data used for **`bnovikov/gemma-4-e4b-audio-v3`**. **Important: this repository contains only the metadata and prompts/answers. The audio files are NOT hosted here.** Each `audio_path` is a source-tagged ID like `librispeech/3664-11714-0019.wav` — the prefix identifies the upstream dataset, the remainder is the upstream filename. Download the four upstream datasets yourself and prepend your local root for each source in your loader. See "Using the dataset" below. ## Why JSONL-only (no audio bundled) Two upstream sources have ambiguous redistribution rights (ClothoAQA's audio from Freesound has per-clip licensing; MusicQA's music is derived from pipelines that pull from sources without bulk-redistribution grants). Rather than ship a partial audio bundle that crosses licensing boundaries, this dataset ships the textual contribution only — questions, answers, source tags, and source-relative audio IDs. See the upstream datasets' own pages for their distribution terms. ## Data | split | rows | sources | |---|---:|---| | `train.jsonl` | 91,196 | MusicQA 47,495 · ClothoAQA 19,947 · FSD50K-QA 14,246 · LibriSpeech-QA 9,508 | | `val.jsonl` | 4,804 | MusicQA 2,505 · ClothoAQA 1,053 · FSD50K-QA 754 · LibriSpeech-QA 492 | Train and val are split by **audio clip** (not by row), so no audio appears in both. Some clips answer multiple questions (~7k in train, ~400 in val) — this is standard for QA datasets. ### Row schema ```json { "audio_path": "clotho_aqa/14430.wav", "messages": [ {"role": "system", "content": "You are an audio understanding assistant..."}, {"role": "user", "content": "how many steps are taken?"}, {"role": "assistant", "content": "five"} ], "source": "clotho_aqa" } ``` The `audio_path` prefix determines which upstream source the file comes from: `librispeech/`, `clotho_aqa/`, `musicqa/`, `fsd50k/`. The system prompt is the same for every row. ### How each source was built | source field | upstream | Q/A provenance | |----------------|----------|----------------| | `clotho_aqa` | ClothoAQA | Q&A taken from upstream, used as-is | | `musicqa` | MusicQA | Q&A taken from upstream, with a light quality filter (rows whose answers contain "not specified", "unknown", "unclear", "likely <name>", or attribution-style phrases like "hit song", "famous artist" were dropped) | | `librispeech_qa` | LibriSpeech (train-clean-100) | Five transcription-style templates applied to each clip; answers are the LibriSpeech transcripts with light sentence-casing normalization | | `fsd50k_qa` | FSD50K (dev set) | Four sound-identification templates; answers are constructed from FSD50K multi-label tags (e.g., `["Harmonica", "Musical instrument", "Music"]` → `"Harmonica, Musical instrument, and Music"`) | ## Using the dataset 1. **Download each upstream dataset** (one-time, external): - LibriSpeech `train-clean-100`: <https://www.openslr.org/12/> - ClothoAQA: <https://zenodo.org/record/6473207> - MusicQA: <https://huggingface.co/datasets/mu-llama/MusicQA> - FSD50K: <https://zenodo.org/record/4060432> 2. **In your data loader**, map the `audio_path` prefix to your local root for that source. A minimal example: ```python import json, os ROOTS = { "librispeech": "/abs/path/to/LibriSpeech/train-clean-100", # searched recursively "clotho_aqa": "/abs/path/to/ClothoAQA/audio", "musicqa": "/abs/path/to/MusicQA/audio", "fsd50k": "/abs/path/to/FSD50K/dev_audio", } def resolve(audio_path: str) -> str: prefix, filename = audio_path.split("/", 1) return os.path.join(ROOTS[prefix], filename) for line in open("train.jsonl"): row = json.loads(line) abs_path = resolve(row["audio_path"]) # ... load audio, feed to model ... ``` LibriSpeech's default layout nests audio files under speaker/chapter subfolders, so resolving there typically needs a recursive basename index rather than a flat join. The other three sources are flat. ## Licenses - **This repository (JSONL):** CC-BY 4.0 for the metadata, structure, system prompt, and the LibriSpeech-QA / FSD50K-QA question templates we authored. The answer text in `clotho_aqa` and `musicqa` rows is copied (with light filtering for MusicQA) from those upstream datasets and remains subject to their source licenses. - **Audio (not included):** each upstream dataset has its own terms. Summary: - LibriSpeech — CC BY 4.0 - FSD50K — CC-BY 4.0 or CC0 per sample - ClothoAQA audio — per-clip Creative Commons variants from Freesound - MusicQA — check the upstream repo's terms before redistributing Users should not consider the audio as under this dataset's license. ## Citation If you use this mix, please also cite the four upstream datasets: ```bibtex @inproceedings{panayotov2015librispeech, title={Librispeech: an ASR corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={ICASSP}, year={2015} } @inproceedings{lipping2022clothoaqa, title={Clotho-{AQA}: A Crowdsourced Dataset for Audio Question Answering}, author={Lipping, Samuel and Sudarsanam, Parthasaarathy and Drossos, Konstantinos and Virtanen, Tuomas}, booktitle={EUSIPCO}, year={2022} } @article{liu2024musicqa, title={Music Understanding {LLaMA}: Advancing Text-to-Music Generation with Question Answering and Caption}, author={Liu, Shansong and Hussain, Atin Sakkeer and Sun, Chenshuo and Shan, Ying}, year={2024} } @article{fonseca2021fsd50k, title={{FSD50K}: An Open Dataset of Human-Labeled Sound Events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM TASLP}, year={2021} } ```

提供机构：

bnovikov

5,000+

优质数据集

54 个

任务类型

进入经典数据集