J017athan/Multimodal-Yue-Benchmark

Name: J017athan/Multimodal-Yue-Benchmark
Creator: J017athan
Published: 2026-03-29 12:25:00
License: 暂无描述

Hugging Face2026-03-29 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/J017athan/Multimodal-Yue-Benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - yue tags: - cantonese - multimodal - speech - benchmark pretty_name: Multimodal Yue Benchmark (audio + text) size_categories: - 10K<n<100K dataset_info: - config_name: gsm8k_hiugaai features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 173071396.88 num_examples: 1249 - name: test num_bytes: 9293462.0 num_examples: 70 download_size: 165211971 dataset_size: 182364858.88 - config_name: gsm8k_hiumaan features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 139023756.8 num_examples: 1249 - name: test num_bytes: 7452854.0 num_examples: 70 download_size: 128833254 dataset_size: 146476610.8 - config_name: gsm8k_wanlung features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 136425017.456 num_examples: 1249 - name: test num_bytes: 7371638.0 num_examples: 70 download_size: 124585085 dataset_size: 143796655.456 - config_name: mmlu_hiugaai features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 514573984.064 num_examples: 3516 - name: test num_bytes: 25366480.0 num_examples: 180 download_size: 460645039 dataset_size: 539940464.064 - config_name: mmlu_hiumaan features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 425052626.328 num_examples: 3538 - name: test num_bytes: 21197697.0 num_examples: 182 download_size: 369799127 dataset_size: 446250323.328 - config_name: mmlu_wanlung features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 408528411.48 num_examples: 3538 - name: test num_bytes: 20642289.0 num_examples: 182 download_size: 350137445 dataset_size: 429170700.48 configs: - config_name: gsm8k_hiugaai data_files: - split: train path: gsm8k_hiugaai/train-* - split: test path: gsm8k_hiugaai/test-* - config_name: gsm8k_hiumaan data_files: - split: train path: gsm8k_hiumaan/train-* - split: test path: gsm8k_hiumaan/test-* - config_name: gsm8k_wanlung data_files: - split: train path: gsm8k_wanlung/train-* - split: test path: gsm8k_wanlung/test-* - config_name: mmlu_hiugaai data_files: - split: train path: mmlu_hiugaai/train-* - split: test path: mmlu_hiugaai/test-* default: true - config_name: mmlu_hiumaan data_files: - split: train path: mmlu_hiumaan/train-* - split: test path: mmlu_hiumaan/test-* - config_name: mmlu_wanlung data_files: - split: train path: mmlu_wanlung/train-* - split: test path: mmlu_wanlung/test-* --- # Multimodal Yue Benchmark Cantonese **audio + text** benchmark derived from **[BillBao/Yue-Benchmark](https://huggingface.co/datasets/BillBao/Yue-Benchmark)** (Yue-GSM8K & Yue-MMLU-style tasks). We kept the same task content in Cantonese and added **TTS** for three Cantonese speakers (`hiugaai`, `hiumaan`, `wanlung`). ## Subsets & splits | Config name | Task | Speaker | Splits | |-------------|------|---------|--------| | `mmlu_*` | multiple-choice (MMLU-style) | per speaker | `train`, `test` | | `gsm8k_*` | math word problems (GSM8K-style) | per speaker | `train`, `test` | Example: ```python from datasets import load_dataset ds = load_dataset("J017athan/Multimodal-Yue-Benchmark", "mmlu_hiugaai", split="train") ``` ## Format Each row has **`conversations`** (ShareGPT-style list of `{"from": "human"|"gpt", "value": "..."}`) and **`audio`** (embedded in Parquet when using the default uploader). Compatible with LLaMA-Factory. ## Upstream Original **[Yue-Benchmark](https://huggingface.co/datasets/BillBao/Yue-Benchmark)**.

提供机构：

J017athan

5,000+

优质数据集

54 个

任务类型

进入经典数据集