shangeth/libriasr-mimi-codes
收藏Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/shangeth/libriasr-mimi-codes
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
tags:
- audio
- text-to-speech
- mimi
- librispeech
- multi-speaker
- speech-synthesis
- codec
task_categories:
- text-to-speech
pretty_name: LibriSpeech ASR — Kyutai Mimi Encoded
size_categories:
- 100K<n<1M
---
# LibriSpeech ASR — Kyutai Mimi Encoded
[LibriSpeech ASR](https://www.openslr.org/12) (train.clean.100) pre-encoded with the [Kyutai Mimi](https://huggingface.co/kyutai/mimi) neural audio codec.
Instead of raw waveforms, every utterance is stored as a compact matrix of discrete codec tokens. This format is ready to use directly in any language-model-style audio generation pipeline without needing a GPU encoder at training time.
## What's inside
```
manifest.jsonl # metadata — one JSON record per utterance
spk_index.json # { "speaker_id": [idx, idx, ...] } — speaker-to-utterance index
shards/
├── shard_0000.pt # packed dict of { idx -> (8, L) int16 code tensor }
├── shard_0001.pt
└── ...
```
Each `manifest.jsonl` record:
```json
{
"idx": 0,
"text": "He was in a confused state of mind.",
"codes_file": "shards/shard_0000.pt:0",
"speaker_id": "1234",
"n_frames": 198
}
```
`spk_index.json` maps each speaker ID to the list of utterance indices for that speaker, useful for sampling reference audio in speaker-conditioned tasks.
## Dataset details
| | |
|---|---|
| Source | [LibriSpeech ASR train.clean.100](https://www.openslr.org/12) |
| Speakers | ~251 |
| Utterances | ~28,000 |
| Total duration | ~100 hours |
| Codec | [Kyutai Mimi](https://huggingface.co/kyutai/mimi) |
| Codec sample rate | 24,000 Hz |
| Codec frame rate | 12.5 fps |
| Codebooks | 8 |
| Token dtype | int16 |
| License | CC BY 4.0 |
## What you can use this for
- Multi-speaker / voice-cloning TTS research
- Speaker-conditioned codec language models
- Speaker representation learning
- Audio tokenization benchmarks
- Any task that benefits from a diverse, multi-speaker English speech corpus in discrete token form
提供机构:
shangeth



