typhoon-ai/thai-dialect-isan-dataset

Name: typhoon-ai/thai-dialect-isan-dataset
Creator: typhoon-ai
Published: 2025-11-26 14:16:07
License: 暂无描述

Hugging Face2025-11-26 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/typhoon-ai/thai-dialect-isan-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - th license: apache-2.0 task_categories: - automatic-speech-recognition tags: - audio - speech-processing - isan-dialect pretty_name: Thai Dialect Isan Speech Corpus size_categories: - 10k<n<100k dataset_info: features: - name: id dtype: string - name: audio struct: - name: array sequence: float32 - name: path dtype: string - name: sampling_rate dtype: int64 - name: raw dtype: string - name: thai_spelling dtype: string - name: isan_spelling dtype: string - name: name dtype: string - name: district dtype: string - name: province dtype: string - name: age dtype: string - name: gender dtype: string - name: question_id dtype: string - name: question dtype: string - name: duration dtype: float64 splits: - name: train num_bytes: 19226874522 num_examples: 9987 - name: test num_bytes: 892537211 num_examples: 500 download_size: 8839972206 dataset_size: 20119411733 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Dataset Card for Thai Dialect Isan Speech Corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Data Statistics](#data-statistics) - [Key Features](#key-features) - [Usage](#usage) - [Additional Information](#additional-information) ## Dataset Description This dataset contains audio recordings of **Isan (Northeastern Thai)** speech, paired with rich transcriptions and demographic metadata. It is designed to support Automatic Speech Recognition (ASR), dialect study, and text normalization tasks for the Isan language. The dataset features spontaneous responses to specific questions, covering two domains (General and Finance), recorded by speakers from different provinces in Northeastern Thailand. - **Language:** Isan (Northeastern Thai) - **Total Examples:** 10,487 - **Input:** Audio (WAV) - **Output:** Transcriptions (Isan spelling, Thai spelling, and Raw annotated format) - **License:** CC-BY-4.0 ## Dataset Structure ### Data Splits | Split | Examples | | ----- | :------: | | Train | 9,987 | | Test | 500 | ### Data Fields Each data point contains the following fields: - **id** (`string`): A unique identifier for the dataset entry. - **audio** (`audio`): A dictionary containing the path to the audio file, the decoded audio array, and the sampling rate. - **raw** (`string`): The raw transcription containing dialect-to-standard annotation tokens. - *Format:* `[Isan Word]<Standard Thai Word>` - *Example:* `"ข้อย[กะ]<ก็>บ่ค่อยมี[แฮง]<แรง>"` - **thai_spelling** (`string`): The transcription normalized to Standard Thai spelling. - *Example:* `"ข้อยก็บ่ค่อยมีแรง"` - **isan_spelling** (`string`): The transcription written in Isan spelling (phonetic to the dialect). - *Example:* `"ข้อยกะบ่ค่อยมีแฮง"` - **name** (`string`): The original filename, often containing metadata codes (e.g., `opentyphoon;is;x_061;gen;0049.wav`). - **district** (`string`): The district (Amphoe) where the speaker resides. - **province** (`string`): The province (Changwat) where the speaker resides. - **age** (`int`): The age of the speaker. - **gender** (`string`): The gender of the speaker. Values include `"m"` (male), `"f"` (female), and `"x"` (not specified). - **question_id** (`string`): The ID of the prompt question asked to the speaker. - **question** (`string`): The text of the question asked to the speaker. - **duration** (`float`): The duration of the audio clip in seconds. ## Key Features ### 1. Rich Annotation Format The `raw` field provides a unique mapping between Isan dialect and Standard Thai. This is valuable for dialect normalization and translation tasks. - **Format:** `[dialect spelling]<standard Thai spelling>` - **Example:** `[เฮา]<เรา>` indicates the speaker said "hao" (Isan) which corresponds to "rao" (Thai for "We/Us"). ### 2. Demographic Diversity The dataset includes speakers from multiple key provinces in the Isan region, allowing for analysis of regional accent variations. Provinces include: - Khon Kaen - Udon Thani - Ubon Ratchathani - Chaiyaphum - Roi Et - Maha Sarakham - Kalasin - Nong Bua Lam Phu - Beung Kan ### 3. Prompted Speech Recordings are responses to specific questions (found in the `question` field), providing context for the speech. This helps in analyzing semantic understanding and sentiment in the local dialect. ## Usage ### Loading the Dataset ```python from datasets import load_dataset import IPython.display as ipd # Load the dataset dataset = load_dataset("scb10x/thai-dialect-isan-dataset") # Select a sample example = dataset['train'][0] # Print transcriptions print(f"Transcript (Isan): {example['isan_spelling']}") print(f"Transcript (Thai): {example['thai_spelling']}") # Listen to audio ipd.Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])

语言： - th 许可协议：Apache-2.0 任务类别： - 自动语音识别标签： - 音频 - 语音处理 - 伊桑方言美观名称：泰语伊桑方言语音语料库样本量区间：10000 < n < 100000 数据集信息：特征： - 名称：id 数据类型：字符串 - 名称：audio 结构体： - 名称：array 序列：float32 - 名称：path 数据类型：字符串 - 名称：sampling_rate 数据类型：int64 - 名称：raw 数据类型：字符串 - 名称：thai_spelling 数据类型：字符串 - 名称：isan_spelling 数据类型：字符串 - 名称：name 数据类型：字符串 - 名称：district 数据类型：字符串 - 名称：province 数据类型：字符串 - 名称：age 数据类型：字符串 - 名称：gender 数据类型：字符串 - 名称：question_id 数据类型：字符串 - 名称：question 数据类型：字符串 - 名称：duration 数据类型：float64 划分： - 名称：train 字节数：19226874522 样本数：9987 - 名称：test 字节数：892537211 样本数：500 下载大小：8839972206 数据集总大小：20119411733 配置： - 配置名称：default 数据文件： - 划分：train 路径：data/train-* - 划分：test 路径：data/test-* # 泰语伊桑方言语音语料库数据集卡片 ## 目录 - [数据集概述](#数据集概述) - [数据集结构](#数据集结构) - [数据统计](#数据统计) - [核心特性](#核心特性) - [使用方法](#使用方法) - [附加信息](#附加信息) ## 数据集概述本数据集包含**伊桑（Isan，东北泰语）**语音录音，并搭配丰富的转录文本与人口统计元数据，旨在支持伊桑语的自动语音识别（Automatic Speech Recognition, ASR）、方言研究与文本归一化任务。数据集收录了针对特定问题的自发应答语音，涵盖通用与金融两大领域，由泰国东北部不同府县的受访者录制。 - **语言**：伊桑（东北泰语） - **总样本数**：10487 - **输入**：音频（WAV格式） - **输出**：转录文本（伊桑正写法、泰语正写法与原始标注格式） - **许可协议**：知识共享署名4.0（CC-BY-4.0） ## 数据集结构 ### 数据划分 | 划分 | 样本数 | | ----- | :------: | | 训练集 | 9987 | | 测试集 | 500 | ### 数据字段每条数据样本包含以下字段： - **id**（`string`）：数据集条目的唯一标识符。 - **audio**（`audio`）：包含音频文件路径、解码后的音频数组与采样率的字典。 - **raw**（`string`）：包含方言-标准语标注Token的原始转录文本。 - *格式*：`[伊桑词汇]<标准泰语词汇>` - *示例*：`"ข้อย[กะ]<ก็>บ่ค่อยมี[แฮง]<แรง>"` - **thai_spelling**（`string`）：归一化为标准泰语正写法的转录文本。 - *示例*：`"ข้อยก็บ่ค่อยมีแรง"` - **isan_spelling**（`string`）：采用伊桑方言正写法（适配该方言的语音拼写）的转录文本。 - *示例*：`"ข้อยกะบ่ค่อยมีแฮง"` - **name**（`string`）：原始文件名，通常包含元数据编码（例如`opentyphoon;is;x_061;gen;0049.wav`）。 - **district**（`string`）：受访者所属的县（Amphoe）。 - **province**（`string`）：受访者所属的府（Changwat）。 - **age**（`int`）：受访者的年龄。 - **gender**（`string`）：受访者的性别，可选值包括`"m"`（男性）、`"f"`（女性）与`"x"`（未指定）。 - **question_id**（`string`）：向受访者提出的提示问题的ID。 - **question**（`string`）：向受访者提出的提示问题文本。 - **duration**（`float`）：音频片段的时长，单位为秒。 ## 核心特性 ### 1. 丰富的标注格式 `raw`字段提供了伊桑方言与标准泰语之间的专属映射，对于方言归一化与翻译任务极具价值。 - **格式**：`[方言拼写]<标准泰语拼写>` - **示例**：`[เฮา]<เรา>` 表示受访者所说的“hao”（伊桑语）对应泰语中的“rao”（我们/咱们）。 ### 2. 人口统计多样性数据集涵盖了伊桑地区多个核心府县的受访者，可用于分析区域口音差异，涉及的府包括： - 孔敬府 - 乌隆他尼府 - 乌汶叻差他尼府 - 猜也蓬府 - 罗伊府 - 马哈沙拉堪府 - 加拉信府 - 农磨兰普府 - 汶甘府（Bueng Kan） ### 3. 提示式语音录制录音均为针对特定问题（可在`question`字段查看）的应答，为语音内容提供了上下文，有助于分析当地方言的语义理解与情感表达。 ## 使用方法 ### 加载数据集 python from datasets import load_dataset import IPython.display as ipd # 加载数据集 dataset = load_dataset("scb10x/thai-dialect-isan-dataset") # 选取一条样本 example = dataset['train'][0] # 打印转录文本 print(f"伊桑语转录：{example['isan_spelling']}") print(f"泰语转录：{example['thai_spelling']}") # 播放音频 ipd.Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])

提供机构：

typhoon-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集