typhoon-ai/thai-dialect-isan-dataset
收藏Hugging Face2025-11-26 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/typhoon-ai/thai-dialect-isan-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- th
license: apache-2.0
task_categories:
- automatic-speech-recognition
tags:
- audio
- speech-processing
- isan-dialect
pretty_name: Thai Dialect Isan Speech Corpus
size_categories:
- 10k<n<100k
dataset_info:
features:
- name: id
dtype: string
- name: audio
struct:
- name: array
sequence: float32
- name: path
dtype: string
- name: sampling_rate
dtype: int64
- name: raw
dtype: string
- name: thai_spelling
dtype: string
- name: isan_spelling
dtype: string
- name: name
dtype: string
- name: district
dtype: string
- name: province
dtype: string
- name: age
dtype: string
- name: gender
dtype: string
- name: question_id
dtype: string
- name: question
dtype: string
- name: duration
dtype: float64
splits:
- name: train
num_bytes: 19226874522
num_examples: 9987
- name: test
num_bytes: 892537211
num_examples: 500
download_size: 8839972206
dataset_size: 20119411733
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
# Dataset Card for Thai Dialect Isan Speech Corpus
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Structure](#dataset-structure)
- [Data Statistics](#data-statistics)
- [Key Features](#key-features)
- [Usage](#usage)
- [Additional Information](#additional-information)
## Dataset Description
This dataset contains audio recordings of **Isan (Northeastern Thai)** speech, paired with rich transcriptions and demographic metadata. It is designed to support Automatic Speech Recognition (ASR), dialect study, and text normalization tasks for the Isan language.
The dataset features spontaneous responses to specific questions, covering two domains (General and Finance), recorded by speakers from different provinces in Northeastern Thailand.
- **Language:** Isan (Northeastern Thai)
- **Total Examples:** 10,487
- **Input:** Audio (WAV)
- **Output:** Transcriptions (Isan spelling, Thai spelling, and Raw annotated format)
- **License:** CC-BY-4.0
## Dataset Structure
### Data Splits
| Split | Examples |
| ----- | :------: |
| Train | 9,987 |
| Test | 500 |
### Data Fields
Each data point contains the following fields:
- **id** (`string`): A unique identifier for the dataset entry.
- **audio** (`audio`): A dictionary containing the path to the audio file, the decoded audio array, and the sampling rate.
- **raw** (`string`): The raw transcription containing dialect-to-standard annotation tokens.
- *Format:* `[Isan Word]<Standard Thai Word>`
- *Example:* `"ข้อย[กะ]<ก็>บ่ค่อยมี[แฮง]<แรง>"`
- **thai_spelling** (`string`): The transcription normalized to Standard Thai spelling.
- *Example:* `"ข้อยก็บ่ค่อยมีแรง"`
- **isan_spelling** (`string`): The transcription written in Isan spelling (phonetic to the dialect).
- *Example:* `"ข้อยกะบ่ค่อยมีแฮง"`
- **name** (`string`): The original filename, often containing metadata codes (e.g., `opentyphoon;is;x_061;gen;0049.wav`).
- **district** (`string`): The district (Amphoe) where the speaker resides.
- **province** (`string`): The province (Changwat) where the speaker resides.
- **age** (`int`): The age of the speaker.
- **gender** (`string`): The gender of the speaker. Values include `"m"` (male), `"f"` (female), and `"x"` (not specified).
- **question_id** (`string`): The ID of the prompt question asked to the speaker.
- **question** (`string`): The text of the question asked to the speaker.
- **duration** (`float`): The duration of the audio clip in seconds.
## Key Features
### 1. Rich Annotation Format
The `raw` field provides a unique mapping between Isan dialect and Standard Thai. This is valuable for dialect normalization and translation tasks.
- **Format:** `[dialect spelling]<standard Thai spelling>`
- **Example:** `[เฮา]<เรา>` indicates the speaker said "hao" (Isan) which corresponds to "rao" (Thai for "We/Us").
### 2. Demographic Diversity
The dataset includes speakers from multiple key provinces in the Isan region, allowing for analysis of regional accent variations. Provinces include:
- Khon Kaen
- Udon Thani
- Ubon Ratchathani
- Chaiyaphum
- Roi Et
- Maha Sarakham
- Kalasin
- Nong Bua Lam Phu
- Beung Kan
### 3. Prompted Speech
Recordings are responses to specific questions (found in the `question` field), providing context for the speech. This helps in analyzing semantic understanding and sentiment in the local dialect.
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
import IPython.display as ipd
# Load the dataset
dataset = load_dataset("scb10x/thai-dialect-isan-dataset")
# Select a sample
example = dataset['train'][0]
# Print transcriptions
print(f"Transcript (Isan): {example['isan_spelling']}")
print(f"Transcript (Thai): {example['thai_spelling']}")
# Listen to audio
ipd.Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])
语言:
- th
许可协议:Apache-2.0
任务类别:
- 自动语音识别
标签:
- 音频
- 语音处理
- 伊桑方言
美观名称:泰语伊桑方言语音语料库
样本量区间:10000 < n < 100000
数据集信息:
特征:
- 名称:id
数据类型:字符串
- 名称:audio
结构体:
- 名称:array
序列:float32
- 名称:path
数据类型:字符串
- 名称:sampling_rate
数据类型:int64
- 名称:raw
数据类型:字符串
- 名称:thai_spelling
数据类型:字符串
- 名称:isan_spelling
数据类型:字符串
- 名称:name
数据类型:字符串
- 名称:district
数据类型:字符串
- 名称:province
数据类型:字符串
- 名称:age
数据类型:字符串
- 名称:gender
数据类型:字符串
- 名称:question_id
数据类型:字符串
- 名称:question
数据类型:字符串
- 名称:duration
数据类型:float64
划分:
- 名称:train
字节数:19226874522
样本数:9987
- 名称:test
字节数:892537211
样本数:500
下载大小:8839972206
数据集总大小:20119411733
配置:
- 配置名称:default
数据文件:
- 划分:train
路径:data/train-*
- 划分:test
路径:data/test-*
# 泰语伊桑方言语音语料库数据集卡片
## 目录
- [数据集概述](#数据集概述)
- [数据集结构](#数据集结构)
- [数据统计](#数据统计)
- [核心特性](#核心特性)
- [使用方法](#使用方法)
- [附加信息](#附加信息)
## 数据集概述
本数据集包含**伊桑(Isan,东北泰语)**语音录音,并搭配丰富的转录文本与人口统计元数据,旨在支持伊桑语的自动语音识别(Automatic Speech Recognition, ASR)、方言研究与文本归一化任务。
数据集收录了针对特定问题的自发应答语音,涵盖通用与金融两大领域,由泰国东北部不同府县的受访者录制。
- **语言**:伊桑(东北泰语)
- **总样本数**:10487
- **输入**:音频(WAV格式)
- **输出**:转录文本(伊桑正写法、泰语正写法与原始标注格式)
- **许可协议**:知识共享署名4.0(CC-BY-4.0)
## 数据集结构
### 数据划分
| 划分 | 样本数 |
| ----- | :------: |
| 训练集 | 9987 |
| 测试集 | 500 |
### 数据字段
每条数据样本包含以下字段:
- **id**(`string`):数据集条目的唯一标识符。
- **audio**(`audio`):包含音频文件路径、解码后的音频数组与采样率的字典。
- **raw**(`string`):包含方言-标准语标注Token的原始转录文本。
- *格式*:`[伊桑词汇]<标准泰语词汇>`
- *示例*:`"ข้อย[กะ]<ก็>บ่ค่อยมี[แฮง]<แรง>"`
- **thai_spelling**(`string`):归一化为标准泰语正写法的转录文本。
- *示例*:`"ข้อยก็บ่ค่อยมีแรง"`
- **isan_spelling**(`string`):采用伊桑方言正写法(适配该方言的语音拼写)的转录文本。
- *示例*:`"ข้อยกะบ่ค่อยมีแฮง"`
- **name**(`string`):原始文件名,通常包含元数据编码(例如`opentyphoon;is;x_061;gen;0049.wav`)。
- **district**(`string`):受访者所属的县(Amphoe)。
- **province**(`string`):受访者所属的府(Changwat)。
- **age**(`int`):受访者的年龄。
- **gender**(`string`):受访者的性别,可选值包括`"m"`(男性)、`"f"`(女性)与`"x"`(未指定)。
- **question_id**(`string`):向受访者提出的提示问题的ID。
- **question**(`string`):向受访者提出的提示问题文本。
- **duration**(`float`):音频片段的时长,单位为秒。
## 核心特性
### 1. 丰富的标注格式
`raw`字段提供了伊桑方言与标准泰语之间的专属映射,对于方言归一化与翻译任务极具价值。
- **格式**:`[方言拼写]<标准泰语拼写>`
- **示例**:`[เฮา]<เรา>` 表示受访者所说的“hao”(伊桑语)对应泰语中的“rao”(我们/咱们)。
### 2. 人口统计多样性
数据集涵盖了伊桑地区多个核心府县的受访者,可用于分析区域口音差异,涉及的府包括:
- 孔敬府
- 乌隆他尼府
- 乌汶叻差他尼府
- 猜也蓬府
- 罗伊府
- 马哈沙拉堪府
- 加拉信府
- 农磨兰普府
- 汶甘府(Bueng Kan)
### 3. 提示式语音录制
录音均为针对特定问题(可在`question`字段查看)的应答,为语音内容提供了上下文,有助于分析当地方言的语义理解与情感表达。
## 使用方法
### 加载数据集
python
from datasets import load_dataset
import IPython.display as ipd
# 加载数据集
dataset = load_dataset("scb10x/thai-dialect-isan-dataset")
# 选取一条样本
example = dataset['train'][0]
# 打印转录文本
print(f"伊桑语转录:{example['isan_spelling']}")
print(f"泰语转录:{example['thai_spelling']}")
# 播放音频
ipd.Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])
提供机构:
typhoon-ai



