J017athan/Multimodal-Yue-Benchmark
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/J017athan/Multimodal-Yue-Benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- yue
tags:
- cantonese
- multimodal
- speech
- benchmark
pretty_name: Multimodal Yue Benchmark (audio + text)
size_categories:
- 10K<n<100K
dataset_info:
- config_name: gsm8k_hiugaai
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: audio
dtype: audio
splits:
- name: train
num_bytes: 173071396.88
num_examples: 1249
- name: test
num_bytes: 9293462.0
num_examples: 70
download_size: 165211971
dataset_size: 182364858.88
- config_name: gsm8k_hiumaan
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: audio
dtype: audio
splits:
- name: train
num_bytes: 139023756.8
num_examples: 1249
- name: test
num_bytes: 7452854.0
num_examples: 70
download_size: 128833254
dataset_size: 146476610.8
- config_name: gsm8k_wanlung
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: audio
dtype: audio
splits:
- name: train
num_bytes: 136425017.456
num_examples: 1249
- name: test
num_bytes: 7371638.0
num_examples: 70
download_size: 124585085
dataset_size: 143796655.456
- config_name: mmlu_hiugaai
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: audio
dtype: audio
splits:
- name: train
num_bytes: 514573984.064
num_examples: 3516
- name: test
num_bytes: 25366480.0
num_examples: 180
download_size: 460645039
dataset_size: 539940464.064
- config_name: mmlu_hiumaan
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: audio
dtype: audio
splits:
- name: train
num_bytes: 425052626.328
num_examples: 3538
- name: test
num_bytes: 21197697.0
num_examples: 182
download_size: 369799127
dataset_size: 446250323.328
- config_name: mmlu_wanlung
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: audio
dtype: audio
splits:
- name: train
num_bytes: 408528411.48
num_examples: 3538
- name: test
num_bytes: 20642289.0
num_examples: 182
download_size: 350137445
dataset_size: 429170700.48
configs:
- config_name: gsm8k_hiugaai
data_files:
- split: train
path: gsm8k_hiugaai/train-*
- split: test
path: gsm8k_hiugaai/test-*
- config_name: gsm8k_hiumaan
data_files:
- split: train
path: gsm8k_hiumaan/train-*
- split: test
path: gsm8k_hiumaan/test-*
- config_name: gsm8k_wanlung
data_files:
- split: train
path: gsm8k_wanlung/train-*
- split: test
path: gsm8k_wanlung/test-*
- config_name: mmlu_hiugaai
data_files:
- split: train
path: mmlu_hiugaai/train-*
- split: test
path: mmlu_hiugaai/test-*
default: true
- config_name: mmlu_hiumaan
data_files:
- split: train
path: mmlu_hiumaan/train-*
- split: test
path: mmlu_hiumaan/test-*
- config_name: mmlu_wanlung
data_files:
- split: train
path: mmlu_wanlung/train-*
- split: test
path: mmlu_wanlung/test-*
---
# Multimodal Yue Benchmark
Cantonese **audio + text** benchmark derived from **[BillBao/Yue-Benchmark](https://huggingface.co/datasets/BillBao/Yue-Benchmark)** (Yue-GSM8K & Yue-MMLU-style tasks). We kept the same task content in Cantonese and added **TTS** for three Cantonese speakers (`hiugaai`, `hiumaan`, `wanlung`).
## Subsets & splits
| Config name | Task | Speaker | Splits |
|-------------|------|---------|--------|
| `mmlu_*` | multiple-choice (MMLU-style) | per speaker | `train`, `test` |
| `gsm8k_*` | math word problems (GSM8K-style) | per speaker | `train`, `test` |
Example:
```python
from datasets import load_dataset
ds = load_dataset("J017athan/Multimodal-Yue-Benchmark", "mmlu_hiugaai", split="train")
```
## Format
Each row has **`conversations`** (ShareGPT-style list of `{"from": "human"|"gpt", "value": "..."}`) and **`audio`** (embedded in Parquet when using the default uploader). Compatible with LLaMA-Factory.
## Upstream
Original **[Yue-Benchmark](https://huggingface.co/datasets/BillBao/Yue-Benchmark)**.
提供机构:
J017athan



