Multilingual-Alpaca-Speech
收藏魔搭社区2025-07-15 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/hongfeixue/Multilingual-Alpaca-Speech
下载链接
链接失效反馈官方服务:
资源简介:
## Multilingual Alpaca Speech Dataset
Multilingual Alpaca Speech is a high-quality speech instruction-following dataset supporting Japanese (ja), German (de), and French (fr). It is generated through a rigorous pipeline: filtering Alpaca text data, translating to target languages, and converting to speech using fish-speech v1.5 with ASR validation to ensure speech quality.
## File Structure
multilingual-alpaca-speech/ <br>
├── de.list # Metadata list for German dataset <br>
├── de.tar.gz # German speech-text data (2.3G) <br>
├── fr.list # Metadata list for French dataset <br>
├── fr.tar.gz # French speech-text data (2.6G) <br>
├── ja-30k.list # Metadata list for Japanese (30k samples) <br>
├── ja-30k.tar.gz # Japanese speech-text data (30k samples, 4.0G) <br>
├── ja.list # Metadata list for Japanese (base) <br>
├── ja.tar.gz # Japanese speech-text data (base, 1.5G) <br>
└── test.tar.gz # Test set for all languages (180M) <br>
## Data Sample Example
Each sample contains paired speech and text data with cross-lingual CoT. Below is a Japanese sample:
```json
{
"id": "alpaca-49791",
"path": "Multilingual-Alpaca-Speech/ja/50000/alpaca-49791-s12-16k.wav",
"text": "Transcribe: アカデミー主演男優賞を受賞した有名俳優の名前を教えてください \n Translation: Name a famous actor who has won an Oscar for Best Actor. \n Answer: Jamie Foxx has won an Oscar for Best Actor for his role in the 2004 drama film \\"Ray\\". \n Back-Translation: ジェイミー・フォックスが2004年のドラマ映画『Ray』でアカデミー賞主演男優賞を受賞した。"
}
```
## Evaluation
This dataset includes a test set (test.tar.gz) with:
Translated OpenHermes and ALPACA test sets
Speech-text pairs for Japanese, German, and French
Example Code for Loading Test Set
```
from datasets import load_dataset, load_from_disk
import numpy as np
# Load Japanese test subset
data = load_from_disk("test/openhermes_instruction_test_ja")['test']
# Iterate over samples
for item in data:
# Audio numpy array
print(np.array(item['context']['array']).shape)
# speech instruction
print(item['speech_instruction'])
# Model response
print(item['answer'])
```
## Citation
If you use this dataset, please cite our paper:
> @article{xue2025enhancing, <br>
> title={Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning}, <br>
> author={Xue, Hongfei and Tang, Yufeng and Liu, Hexin and Zhang, Jun and Geng, Xuelong and Xie, Lei}, <br>
> booktitle={Proc. ACM MM 2025}, <br>
> year={2025} <br>
> }
For questions or issues, contact hfxue@mial.nwpu.edu.cn.
# 多语言Alpaca语音数据集(Multilingual Alpaca Speech Dataset)
多语言Alpaca语音数据集是一款高质量的语音指令遵循数据集,支持日语(ja)、德语(de)与法语(fr)。其构建流程十分严谨:先对Alpaca文本数据进行筛选,再将其翻译至目标语言,最后借助fish-speech v1.5生成语音,并通过自动语音识别(Automatic Speech Recognition, ASR)验证以确保语音质量。
## 文件结构
multilingual-alpaca-speech/
├── de.list # 德语数据集元数据列表
├── de.tar.gz # 德语语音-文本数据(2.3G)
├── fr.list # 法语数据集元数据列表
├── fr.tar.gz # 法语语音-文本数据(2.6G)
├── ja-30k.list # 日语(3万样本)数据集元数据列表
├── ja-30k.tar.gz # 日语(3万样本)语音-文本数据(4.0G)
├── ja.list # 日语(基础版)数据集元数据列表
├── ja.tar.gz # 日语(基础版)语音-文本数据(1.5G)
└── test.tar.gz # 全语言测试集(180M)
## 数据样本示例
每个样本均包含带跨语言思维链(Cross-lingual Chain-of-Thought, CoT)的语音与文本配对数据。以下为日语样本示例:
json
{
"id": "alpaca-49791",
"path": "Multilingual-Alpaca-Speech/ja/50000/alpaca-49791-s12-16k.wav",
"text": "转录:アカデミー主演男優賞を受賞した有名俳優の名前を教えてください
翻译:请列举一位斩获奥斯卡最佳男主角奖的知名演员。
回答:杰米·福克斯曾凭借2004年剧情电影《灵魂歌王》(Ray)中的角色,夺得奥斯卡最佳男主角奖。
回译:ジェイミー・フォックスが2004年のドラマ映画『Ray』でアカデミー賞主演男優賞を受賞した。"
}
## 评测
本数据集包含测试集test.tar.gz,其中涵盖:翻译后的OpenHermes与Alpaca测试集,以及日语、德语、法语的语音-文本配对数据。
### 测试集加载示例代码
python
from datasets import load_dataset, load_from_disk
import numpy as np
# 加载日语测试子集
data = load_from_disk("test/openhermes_instruction_test_ja")['test']
# 遍历样本
for item in data:
# 音频numpy数组
print(np.array(item['context']['array']).shape)
# 语音指令
print(item['speech_instruction'])
# 模型回复
print(item['answer'])
## 引用
若您使用本数据集,请引用以下论文:
> @article{xue2025enhancing,
> title={Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning},
> author={Xue, Hongfei and Tang, Yufeng and Liu, Hexin and Zhang, Jun and Geng, Xuelong and Xie, Lei},
> booktitle={Proc. ACM MM 2025},
> year={2025}
> }
若您有相关疑问或问题,请联系邮箱:hfxue@mial.nwpu.edu.cn。
提供机构:
maas
创建时间:
2025-07-14



