five

mwei/Belle_1.4M-SLAM-Omni

收藏
Hugging Face2026-02-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mwei/Belle_1.4M-SLAM-Omni
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: gpl-3.0 dataset_info: features: - name: split_name dtype: string - name: index dtype: int64 - name: round dtype: int64 - name: question dtype: string - name: question_audio struct: - name: array sequence: float32 - name: path dtype: string - name: sampling_rate dtype: int64 - name: answer dtype: string - name: answer_cosyvoice_speech_token sequence: int64 - name: answer_snac dtype: string splits: - name: train num_bytes: 800059817200 num_examples: 1400398 download_size: 792877562556 dataset_size: 800059817200 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering language: - zh size_categories: - 1M<n<10M --- # Belle_1.4M *This dataset is prepared for the reproduction of [SLAM-Omni](https://arxiv.org/abs/2412.15649).* This is a **multi-round Chinese spoken dialogue** training dataset. For code and usage examples, please refer to the related GitHub repository: [X-LANCE/SLAM-LLM (examples/s2s)](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/s2s) ## 🔧 Modifications 1. **Data Filtering**: We removed samples with excessively long data. 2. **Speech Response Tokens**: We used [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) to synthesize corresponding semantic speech tokens for the speech response. These tokens, represented as `answer_cosyvoice_speech_token`, are included as model training targets. 3. **User Instruction Speech**: Synthesized speech for user instructions using CosyVoice, with timbres randomly selected from 1,010 Chinese prompts in the [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval) subset to ensure diversity. ## 🙏 Acknowledgment The original dataset was sourced from [Belle_train_3.5M_CN](https://huggingface.co/datasets/BelleGroup/train_3.5M_CN). We thank the Belle Group for their open-source contribution. ## 📄 Citation If you find our work helpful, please consider citing: ```bibtex @article{chen2024slam, title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training}, author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others}, journal={arXiv preprint arXiv:2412.15649}, year={2024} } ```

许可证:GPL-3.0 数据集信息: 特征字段: - 名称:拆分名称(split_name),数据类型:字符串(string) - 名称:索引(index),数据类型:64位整数(int64) - 名称:轮次(round),数据类型:64位整数(int64) - 名称:问题(question),数据类型:字符串(string) - 名称:问题音频(question_audio),结构: - 名称:数组(array),序列类型:32位浮点数(float32) - 名称:路径(path),数据类型:字符串(string) - 名称:采样率(sampling_rate),数据类型:64位整数(int64) - 名称:回答(answer),数据类型:字符串(string) - 名称:CosyVoice语音Token序列(answer_cosyvoice_speech_token),序列类型:64位整数(int64) - 名称:SNAC回答(answer_snac),数据类型:字符串(string) 拆分信息: - 名称:训练集(train),字节数:800059817200,样本数:1400398 下载大小(download_size):792877562556 数据集总大小(dataset_size):800059817200 配置项: - 配置名称(config_name):default,数据文件(data_files): - 拆分(split):train,路径:data/train-* 任务类别(task_categories): - 问答任务(question-answering) 语言(language): - 中文(zh) 样本规模区间(size_categories): - 100万<n<1000万 # Belle_1.4M *本数据集专为复现[SLAM-Omni](https://arxiv.org/abs/2412.15649)而构建。* 本数据集为**多轮中文口语对话**训练数据集。相关代码与使用示例请参阅对应GitHub仓库:[X-LANCE/SLAM-LLM (examples/s2s)](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/s2s) ## 🔧 调整说明 1. **数据筛选**:我们移除了长度超出合理范围的样本。 2. **语音回复Token**:我们使用[CosyVoice](https://github.com/FunAudioLLM/CosyVoice)为语音回复生成对应的语义语音Token。这类以`answer_cosyvoice_speech_token`表示的Token将作为模型训练目标纳入数据集。 3. **用户指令语音**:我们使用CosyVoice为用户指令合成语音,音色从[seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval)子集的1010条中文提示语中随机选取,以保证多样性。 ## 🙏 致谢 本数据集的原始数据源自[Belle_train_3.5M_CN](https://huggingface.co/datasets/BelleGroup/train_3.5M_CN)。我们感谢Belle团队的开源贡献。 ## 📄 引用 若您的工作受益于本数据集,请引用以下文献: bibtex @article{chen2024slam, title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training}, author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others}, journal={arXiv preprint arXiv:2412.15649}, year={2024} }
提供机构:
mwei
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作