TaigiSpeech/TaigiSpeech

Name: TaigiSpeech/TaigiSpeech
Creator: TaigiSpeech
Published: 2026-03-24 04:07:32
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/TaigiSpeech/TaigiSpeech

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - nan license: cc-by-4.0 task_categories: - audio-classification tags: - spoken-language-understanding - intent-classification - taiwanese - taigi pretty_name: TaigiSpeech size_categories: - 1K<n<10K dataset_info: features: - name: audio dtype: audio - name: speaker_id dtype: string - name: intent dtype: class_label: names: '0': BREATHING_CHEST_EMERG '1': CALL_CONTACT '2': CANCEL_ALERT '3': FALL_HELP '4': LIGHT_OFF '5': LIGHT_ON '6': PAIN_GENERAL '7': SOS_CALL configs: - config_name: default data_files: - split: train path: data/train/** - split: val path: data/val/** - split: test path: data/test/** --- # TaigiSpeech A spoken language understanding (SLU) dataset for Taiwanese (台語/Taigi) intent classification, designed for elder-care and smart-home voice command scenarios. **Paper**: [TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild](https://arxiv.org/abs/2603.21478) ## Dataset Description TaigiSpeech contains 3,000+ Taiwanese speech utterances from 21 speakers, each labeled with one of 8 intent classes. The dataset is designed to support research in spoken language understanding for Taiwanese, a low-resource language. ### Supported Tasks - **Intent Classification**: Classify spoken Taiwanese commands into 8 intent categories. ### Languages - Taiwanese Taigi (Taiwanese Hokkien / Southern Min) ## Dataset Structure ### Splits | Split | Samples | Speakers | Notes | |-------|---------|----------|-------| | Train | 1,600 | 10 | 200 per intent (balanced) | | Val | 519 | 5 | ~64–67 per intent | | Test | 960 | 6 | 120 per intent (balanced) | Speakers are **disjoint** across splits (no speaker overlap). ### Intent Classes | Intent | Description | |--------|-------------| | `SOS_CALL` | Emergency help request | | `FALL_HELP` | Fall-related assistance | | `BREATHING_CHEST_EMERG` | Breathing or chest emergency | | `PAIN_GENERAL` | General pain report | | `CALL_CONTACT` | Call a contact person | | `LIGHT_ON` | Turn on lights | | `LIGHT_OFF` | Turn off lights | | `CANCEL_ALERT` | Cancel an alert | ### Data Fields Each sample in `metadata.jsonl` contains: - `file_name` (str): Relative path to the audio file (resolved as `audio` column by HF). - `speaker_id` (str): Anonymized speaker identifier (e.g., `p001`). - `intent` (str): One of 8 intent labels. ### Audio Specifications - **Format**: WAV - **Sample Rate**: 48 kHz - **Channels**: Mono ### Directory Layout ``` TaigiSpeech/ ├── README.md ├── metadata/ │ ├── p001_profile.json │ └── ... └── data/ ├── train/ │ ├── metadata.jsonl │ └── audio/ ├── val/ │ ├── metadata.jsonl │ └── audio/ └── test/ ├── metadata.jsonl └── audio/ ``` ### Speaker Profiles The `metadata/` directory contains anonymized speaker profiles with demographic information: age, gender, education level, hometown, native language(s), and language fluency ratings. ## Speaker Demographics - **Number of Speakers**: 21 (p001–p022, excluding p018) - **Age Range**: 20–78 years (majority 54+) - **Gender**: Mixed male and female - **Regions**: Keelung, Taipei, New Taipei, Yilan, Yunlin, Taichung, Tainan, Chiayi - **Recording Devices**: iPad, MacBook, external USB microphones ## Usage ```python from datasets import load_dataset dataset = load_dataset("TaigiSpeech/TaigiSpeech") ``` ## Citation If you use this dataset, please cite: ```bibtex @article{chang2026taigispeech, title = {TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild}, author = {Chang, Kai-Wei and Lin, Yi-Cheng and Chou, Huang-Cheng and Ren, Wenze and Huang, Yu-Han and Tsai, Yun-Shao and Chen, Chien-Cheng and Tsao, Yu and Liao, Yuan-Fu and Narayanan, Shrikanth and Glass, James and Lee, Hung-yi}, journal = {arXiv preprint arXiv:2603.21478}, year = {2026} } ``` ## License This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

提供机构：

TaigiSpeech

5,000+

优质数据集

54 个

任务类型

进入经典数据集