mwei/Belle_1.4M-SLAM-Omni

Name: mwei/Belle_1.4M-SLAM-Omni
Creator: mwei
Published: 2026-02-03 11:00:56
License: 暂无描述

Hugging Face2026-02-03 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/mwei/Belle_1.4M-SLAM-Omni

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: gpl-3.0 dataset_info: features: - name: split_name dtype: string - name: index dtype: int64 - name: round dtype: int64 - name: question dtype: string - name: question_audio struct: - name: array sequence: float32 - name: path dtype: string - name: sampling_rate dtype: int64 - name: answer dtype: string - name: answer_cosyvoice_speech_token sequence: int64 - name: answer_snac dtype: string splits: - name: train num_bytes: 800059817200 num_examples: 1400398 download_size: 792877562556 dataset_size: 800059817200 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering language: - zh size_categories: - 1M<n<10M --- # Belle_1.4M *This dataset is prepared for the reproduction of [SLAM-Omni](https://arxiv.org/abs/2412.15649).* This is a **multi-round Chinese spoken dialogue** training dataset. For code and usage examples, please refer to the related GitHub repository: [X-LANCE/SLAM-LLM (examples/s2s)](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/s2s) ## 🔧 Modifications 1. **Data Filtering**: We removed samples with excessively long data. 2. **Speech Response Tokens**: We used [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) to synthesize corresponding semantic speech tokens for the speech response. These tokens, represented as `answer_cosyvoice_speech_token`, are included as model training targets. 3. **User Instruction Speech**: Synthesized speech for user instructions using CosyVoice, with timbres randomly selected from 1,010 Chinese prompts in the [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval) subset to ensure diversity. ## 🙏 Acknowledgment The original dataset was sourced from [Belle_train_3.5M_CN](https://huggingface.co/datasets/BelleGroup/train_3.5M_CN). We thank the Belle Group for their open-source contribution. ## 📄 Citation If you find our work helpful, please consider citing: ```bibtex @article{chen2024slam, title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training}, author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others}, journal={arXiv preprint arXiv:2412.15649}, year={2024} } ```

许可证：GPL-3.0 数据集信息：特征字段： - 名称：拆分名称（split_name），数据类型：字符串（string） - 名称：索引（index），数据类型：64位整数（int64） - 名称：轮次（round），数据类型：64位整数（int64） - 名称：问题（question），数据类型：字符串（string） - 名称：问题音频（question_audio），结构： - 名称：数组（array），序列类型：32位浮点数（float32） - 名称：路径（path），数据类型：字符串（string） - 名称：采样率（sampling_rate），数据类型：64位整数（int64） - 名称：回答（answer），数据类型：字符串（string） - 名称：CosyVoice语音Token序列（answer_cosyvoice_speech_token），序列类型：64位整数（int64） - 名称：SNAC回答（answer_snac），数据类型：字符串（string）拆分信息： - 名称：训练集（train），字节数：800059817200，样本数：1400398 下载大小（download_size）：792877562556 数据集总大小（dataset_size）：800059817200 配置项： - 配置名称（config_name）：default，数据文件（data_files）： - 拆分（split）：train，路径：data/train-* 任务类别（task_categories）： - 问答任务（question-answering）语言（language）： - 中文（zh）样本规模区间（size_categories）： - 100万<n<1000万 # Belle_1.4M *本数据集专为复现[SLAM-Omni](https://arxiv.org/abs/2412.15649)而构建。* 本数据集为**多轮中文口语对话**训练数据集。相关代码与使用示例请参阅对应GitHub仓库：[X-LANCE/SLAM-LLM (examples/s2s)](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/s2s) ## 🔧 调整说明 1. **数据筛选**：我们移除了长度超出合理范围的样本。 2. **语音回复Token**：我们使用[CosyVoice](https://github.com/FunAudioLLM/CosyVoice)为语音回复生成对应的语义语音Token。这类以`answer_cosyvoice_speech_token`表示的Token将作为模型训练目标纳入数据集。 3. **用户指令语音**：我们使用CosyVoice为用户指令合成语音，音色从[seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval)子集的1010条中文提示语中随机选取，以保证多样性。 ## 🙏 致谢本数据集的原始数据源自[Belle_train_3.5M_CN](https://huggingface.co/datasets/BelleGroup/train_3.5M_CN)。我们感谢Belle团队的开源贡献。 ## 📄 引用若您的工作受益于本数据集，请引用以下文献： bibtex @article{chen2024slam, title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training}, author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others}, journal={arXiv preprint arXiv:2412.15649}, year={2024} }

提供机构：

mwei

5,000+

优质数据集

54 个

任务类型

进入经典数据集