five

J017athan/Multimodal-Yue-Benchmark

收藏
Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/J017athan/Multimodal-Yue-Benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - yue tags: - cantonese - multimodal - speech - benchmark pretty_name: Multimodal Yue Benchmark (audio + text) size_categories: - 10K<n<100K dataset_info: - config_name: gsm8k_hiugaai features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 173071396.88 num_examples: 1249 - name: test num_bytes: 9293462.0 num_examples: 70 download_size: 165211971 dataset_size: 182364858.88 - config_name: gsm8k_hiumaan features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 139023756.8 num_examples: 1249 - name: test num_bytes: 7452854.0 num_examples: 70 download_size: 128833254 dataset_size: 146476610.8 - config_name: gsm8k_wanlung features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 136425017.456 num_examples: 1249 - name: test num_bytes: 7371638.0 num_examples: 70 download_size: 124585085 dataset_size: 143796655.456 - config_name: mmlu_hiugaai features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 514573984.064 num_examples: 3516 - name: test num_bytes: 25366480.0 num_examples: 180 download_size: 460645039 dataset_size: 539940464.064 - config_name: mmlu_hiumaan features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 425052626.328 num_examples: 3538 - name: test num_bytes: 21197697.0 num_examples: 182 download_size: 369799127 dataset_size: 446250323.328 - config_name: mmlu_wanlung features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: audio dtype: audio splits: - name: train num_bytes: 408528411.48 num_examples: 3538 - name: test num_bytes: 20642289.0 num_examples: 182 download_size: 350137445 dataset_size: 429170700.48 configs: - config_name: gsm8k_hiugaai data_files: - split: train path: gsm8k_hiugaai/train-* - split: test path: gsm8k_hiugaai/test-* - config_name: gsm8k_hiumaan data_files: - split: train path: gsm8k_hiumaan/train-* - split: test path: gsm8k_hiumaan/test-* - config_name: gsm8k_wanlung data_files: - split: train path: gsm8k_wanlung/train-* - split: test path: gsm8k_wanlung/test-* - config_name: mmlu_hiugaai data_files: - split: train path: mmlu_hiugaai/train-* - split: test path: mmlu_hiugaai/test-* default: true - config_name: mmlu_hiumaan data_files: - split: train path: mmlu_hiumaan/train-* - split: test path: mmlu_hiumaan/test-* - config_name: mmlu_wanlung data_files: - split: train path: mmlu_wanlung/train-* - split: test path: mmlu_wanlung/test-* --- # Multimodal Yue Benchmark Cantonese **audio + text** benchmark derived from **[BillBao/Yue-Benchmark](https://huggingface.co/datasets/BillBao/Yue-Benchmark)** (Yue-GSM8K & Yue-MMLU-style tasks). We kept the same task content in Cantonese and added **TTS** for three Cantonese speakers (`hiugaai`, `hiumaan`, `wanlung`). ## Subsets & splits | Config name | Task | Speaker | Splits | |-------------|------|---------|--------| | `mmlu_*` | multiple-choice (MMLU-style) | per speaker | `train`, `test` | | `gsm8k_*` | math word problems (GSM8K-style) | per speaker | `train`, `test` | Example: ```python from datasets import load_dataset ds = load_dataset("J017athan/Multimodal-Yue-Benchmark", "mmlu_hiugaai", split="train") ``` ## Format Each row has **`conversations`** (ShareGPT-style list of `{"from": "human"|"gpt", "value": "..."}`) and **`audio`** (embedded in Parquet when using the default uploader). Compatible with LLaMA-Factory. ## Upstream Original **[Yue-Benchmark](https://huggingface.co/datasets/BillBao/Yue-Benchmark)**.
提供机构:
J017athan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作