mwei/Belle_1.4M-SLAM-Omni
收藏Hugging Face2026-02-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mwei/Belle_1.4M-SLAM-Omni
下载链接
链接失效反馈官方服务:
资源简介:
---
license: gpl-3.0
dataset_info:
features:
- name: split_name
dtype: string
- name: index
dtype: int64
- name: round
dtype: int64
- name: question
dtype: string
- name: question_audio
struct:
- name: array
sequence: float32
- name: path
dtype: string
- name: sampling_rate
dtype: int64
- name: answer
dtype: string
- name: answer_cosyvoice_speech_token
sequence: int64
- name: answer_snac
dtype: string
splits:
- name: train
num_bytes: 800059817200
num_examples: 1400398
download_size: 792877562556
dataset_size: 800059817200
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- question-answering
language:
- zh
size_categories:
- 1M<n<10M
---
# Belle_1.4M
*This dataset is prepared for the reproduction of [SLAM-Omni](https://arxiv.org/abs/2412.15649).*
This is a **multi-round Chinese spoken dialogue** training dataset. For code and usage examples, please refer to the related GitHub repository: [X-LANCE/SLAM-LLM (examples/s2s)](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/s2s)
## 🔧 Modifications
1. **Data Filtering**: We removed samples with excessively long data.
2. **Speech Response Tokens**: We used [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) to synthesize corresponding semantic speech tokens for the speech response. These tokens, represented as `answer_cosyvoice_speech_token`, are included as model training targets.
3. **User Instruction Speech**: Synthesized speech for user instructions using CosyVoice, with timbres randomly selected from 1,010 Chinese prompts in the [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval) subset to ensure diversity.
## 🙏 Acknowledgment
The original dataset was sourced from [Belle_train_3.5M_CN](https://huggingface.co/datasets/BelleGroup/train_3.5M_CN). We thank the Belle Group for their open-source contribution.
## 📄 Citation
If you find our work helpful, please consider citing:
```bibtex
@article{chen2024slam,
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
journal={arXiv preprint arXiv:2412.15649},
year={2024}
}
```
许可证:GPL-3.0
数据集信息:
特征字段:
- 名称:拆分名称(split_name),数据类型:字符串(string)
- 名称:索引(index),数据类型:64位整数(int64)
- 名称:轮次(round),数据类型:64位整数(int64)
- 名称:问题(question),数据类型:字符串(string)
- 名称:问题音频(question_audio),结构:
- 名称:数组(array),序列类型:32位浮点数(float32)
- 名称:路径(path),数据类型:字符串(string)
- 名称:采样率(sampling_rate),数据类型:64位整数(int64)
- 名称:回答(answer),数据类型:字符串(string)
- 名称:CosyVoice语音Token序列(answer_cosyvoice_speech_token),序列类型:64位整数(int64)
- 名称:SNAC回答(answer_snac),数据类型:字符串(string)
拆分信息:
- 名称:训练集(train),字节数:800059817200,样本数:1400398
下载大小(download_size):792877562556
数据集总大小(dataset_size):800059817200
配置项:
- 配置名称(config_name):default,数据文件(data_files):
- 拆分(split):train,路径:data/train-*
任务类别(task_categories):
- 问答任务(question-answering)
语言(language):
- 中文(zh)
样本规模区间(size_categories):
- 100万<n<1000万
# Belle_1.4M
*本数据集专为复现[SLAM-Omni](https://arxiv.org/abs/2412.15649)而构建。*
本数据集为**多轮中文口语对话**训练数据集。相关代码与使用示例请参阅对应GitHub仓库:[X-LANCE/SLAM-LLM (examples/s2s)](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/s2s)
## 🔧 调整说明
1. **数据筛选**:我们移除了长度超出合理范围的样本。
2. **语音回复Token**:我们使用[CosyVoice](https://github.com/FunAudioLLM/CosyVoice)为语音回复生成对应的语义语音Token。这类以`answer_cosyvoice_speech_token`表示的Token将作为模型训练目标纳入数据集。
3. **用户指令语音**:我们使用CosyVoice为用户指令合成语音,音色从[seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval)子集的1010条中文提示语中随机选取,以保证多样性。
## 🙏 致谢
本数据集的原始数据源自[Belle_train_3.5M_CN](https://huggingface.co/datasets/BelleGroup/train_3.5M_CN)。我们感谢Belle团队的开源贡献。
## 📄 引用
若您的工作受益于本数据集,请引用以下文献:
bibtex
@article{chen2024slam,
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
journal={arXiv preprint arXiv:2412.15649},
year={2024}
}
提供机构:
mwei



