M64/sid-music

Hugging Face2026-01-03 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/M64/sid-music

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en tags: - music - audio - commodore-64 - sid - chiptune size_categories: - 100M<n<1B --- # SID Music Dataset Register dumps from 2,418 Commodore 64 SID files for training music generation models. 9000 frames for each file, corresponding to 3 minutes of the sid file. ## Dataset Description - **Source**: [HVSC](https://hvsc.c64.org/) (High Voltage SID Collection) - **Size**: 1GB of register dump sequences - **Format**: Hex-encoded SID register states at 50Hz - **Songs**: 2,410 files from 15 legendary composers ## Composers Included | Composer | Songs | |----------|-------| | DRAX (Thomas Mogensen) | 1042 | | Laxity (Thomas E. Petersen) | 274 | | Jeroen Tel | 163 | | Thomas Detert | 162 | | Reyn Ouwehand | 124 | | David Whittaker | 98 | | Ben Daglish | 86 | | Johannes Bjerregaard | 84 | | Rob Hubbard | 78 | | Jonathan Dunn | 67 | | Matt Gray | 47 | | Charles Deenen | 46 | | Chris Hülsbeck | 42 | | Mark Cooksey | 39 | | Martin Galway | 34 | | **Total** | **2,418** | ## Data Format Each frame is 25 SID registers encoded as 50 hex characters: ``` B0080005410A306011C0064108200016800D41082000B4031F B0084005410A30601100074108200016C00D41082000B4031F B0088005410A30601140074108200016000E41082000B4031F ... <end> ``` - 50 hex characters = 25 bytes (SID registers $D400-$D418) - `<end>` marks song boundaries - 50 frames = 1 second of audio ## Register Layout ``` Bytes 0-6: Voice 1 (freq, pulse width, control, envelope) Bytes 7-13: Voice 2 Bytes 14-20: Voice 3 Bytes 21-24: Filter + Volume ``` ## Usage ### Quick Start with SidGPT ```bash # Clone SidGPT git clone https://github.com/M64GitHub/SidGPT cd SidGPT pip install torch numpy tqdm # Download this dataset wget https://huggingface.co/datasets/M64/sid-music/resolve/main/training.txt.gz gunzip training.txt.gz mv training.txt training/data/sid/input.txt # Tokenize & Train cd training/data/sid && python prepare.py && cd ../.. python train.py config/train_sid.py ``` ### Or Use Pre-trained Model Skip training entirely: - [SID-GPT 25M Model](https://huggingface.co/M64/sid-gpt-25m) ### Manual / Custom Training If using your own training setup: 1. **Download**: `training.txt.gz` (~100MB compressed, ~1GB uncompressed) 2. **Format**: Character-level, 22-token vocabulary 3. **Tokenize**: Map characters to indices: ```python vocab = ['\n', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', '<', '>', 'd', 'e', 'n'] char_to_idx = {c: i for i, c in enumerate(vocab)} ``` 4. **Train**: Any GPT/transformer architecture works. Recommended: - Block size: 1020+ tokens (20+ frames context) - Character-level prediction (no BPE) ### Pre-trained Model Skip training and use the pre-trained model directly: - [SID-GPT 25M](https://huggingface.co/M64/sid-gpt-25m) ### Statistics - Total characters: ~1,000,000,000 - Vocabulary: 22 tokens (`0-9`, `A-F`, `<`, `>`, `d`, `e`, `n`, `\n`) - Average song length: 9000 frames (~ 3 minutes) ## License MIT License. Original SID files from HVSC are © their respective composers. This dataset contains derived register dumps for research purposes. ## Citation ```bibtex @misc{sidmusicdataset2026, author = {Mario Schallner}, title = {SID Music Dataset: C64 Register Dumps for ML}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/M64/sid-music} } ``` ## Related - [SID-GPT Model](https://huggingface.co/M64/sid-gpt-25m) - [SidGPT GitHub](https://github.com/M64GitHub/SidGPT) - [HVSC](https://hvsc.c64.org/)

提供机构：

M64

5,000+

优质数据集

54 个

任务类型

进入经典数据集