thai-isan-dialect-dataset

Name: thai-isan-dialect-dataset
Creator: maas
Published: 2025-12-05 16:57:30
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/scb10x/thai-isan-dialect-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Thai Dialect Isan Speech Corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Data Statistics](#data-statistics) - [Key Features](#key-features) - [Usage](#usage) - [Additional Information](#additional-information) ## Dataset Description This dataset contains audio recordings of **Isan (Northeastern Thai)** speech, paired with rich transcriptions and demographic metadata. It is designed to support Automatic Speech Recognition (ASR), dialect study, and text normalization tasks for the Isan language. The dataset features spontaneous responses to specific questions, covering two domains (General and Finance), recorded by speakers from different provinces in Northeastern Thailand. - **Language:** Isan (Northeastern Thai) - **Total Examples:** 10,487 - **Input:** Audio (WAV) - **Output:** Transcriptions (Isan spelling, Thai spelling, and Raw annotated format) - **License:** CC-BY-4.0 ## Dataset Structure ### Data Splits | Split | Examples | | ----- | :------: | | Train | 9,987 | | Test | 500 | ### Data Fields Each data point contains the following fields: - **id** (`string`): A unique identifier for the dataset entry. - **audio** (`audio`): A dictionary containing the path to the audio file, the decoded audio array, and the sampling rate. - **raw** (`string`): The raw transcription containing dialect-to-standard annotation tokens. - *Format:* `[Isan Word]<Standard Thai Word>` - *Example:* `"ข้อย[กะ]<ก็>บ่ค่อยมี[แฮง]<แรง>"` - **thai_spelling** (`string`): The transcription normalized to Standard Thai spelling. - *Example:* `"ข้อยก็บ่ค่อยมีแรง"` - **isan_spelling** (`string`): The transcription written in Isan spelling (phonetic to the dialect). - *Example:* `"ข้อยกะบ่ค่อยมีแฮง"` - **name** (`string`): The original filename, often containing metadata codes (e.g., `opentyphoon;is;x_061;gen;0049.wav`). - **district** (`string`): The district (Amphoe) where the speaker resides. - **province** (`string`): The province (Changwat) where the speaker resides. - **age** (`int`): The age of the speaker. - **gender** (`string`): The gender of the speaker. Values include `"m"` (male), `"f"` (female), and `"x"` (not specified). - **question_id** (`string`): The ID of the prompt question asked to the speaker. - **question** (`string`): The text of the question asked to the speaker. - **duration** (`float`): The duration of the audio clip in seconds. ## Key Features ### 1. Rich Annotation Format The `raw` field provides a unique mapping between Isan dialect and Standard Thai. This is valuable for dialect normalization and translation tasks. - **Format:** `[dialect spelling]<standard Thai spelling>` - **Example:** `[เฮา]<เรา>` indicates the speaker said "hao" (Isan) which corresponds to "rao" (Thai for "We/Us"). ### 2. Demographic Diversity The dataset includes speakers from multiple key provinces in the Isan region, allowing for analysis of regional accent variations. Provinces include: - Khon Kaen - Udon Thani - Ubon Ratchathani - Chaiyaphum - Roi Et - Maha Sarakham - Kalasin - Nong Bua Lam Phu - Beung Kan ### 3. Prompted Speech Recordings are responses to specific questions (found in the `question` field), providing context for the speech. This helps in analyzing semantic understanding and sentiment in the local dialect. ## Usage ### Loading the Dataset ```python from datasets import load_dataset import IPython.display as ipd # Load the dataset dataset = load_dataset("scb10x/thai-dialect-isan-dataset") # Select a sample example = dataset['train'][0] # Print transcriptions print(f"Transcript (Isan): {example['isan_spelling']}") print(f"Transcript (Thai): {example['thai_spelling']}") # Listen to audio ipd.Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])

# 泰语伊桑方言语音语料库数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集结构](#dataset-structure) - [数据统计](#data-statistics) - [核心特征](#key-features) - [使用方法](#usage) - [附加信息](#additional-information) ## 数据集描述本数据集包含**伊桑语（泰国东北部方言）**的语音录音，并配套丰富的转录文本与人口统计元数据。其设计目标是为伊桑语的自动语音识别（ASR）、方言研究以及文本归一化任务提供支持。本数据集收录针对特定问题的自发式应答语音，涵盖通用与金融两个领域，录音受访者均来自泰国东北部的不同省份。 - **语言：** 伊桑语（泰国东北部方言） - **总样本量：** 10,487 - **输入：** 音频（WAV格式） - **输出：** 转录文本（包含伊桑语拼写、泰语拼写与原始标注格式三种形式） - **授权协议：** CC-BY-4.0 ## 数据集结构 ### 数据划分 | 数据集划分 | 样本量 | | :--------: | :----: | | 训练集 | 9,987 | | 测试集 | 500 | ### 数据字段每个数据样本包含以下字段： - **id**（字符串类型）：数据集条目的唯一标识符。 - **audio**（音频类型）：包含音频文件路径、解码后的音频数组与采样率的字典。 - **raw**（字符串类型）：包含方言到标准语标注Token的原始转录文本。 - 格式：`[伊桑语词汇]<标准泰语词汇>` - 示例：`"ข้อย[กะ]<ก็>บ่ค่อยมี[แฮง]<แรง>"` - **thai_spelling**（字符串类型）：归一化为标准泰语拼写的转录文本。 - 示例：`"ข้อยก็บ่ค่อยมีแรง"` - **isan_spelling**（字符串类型）：采用伊桑语拼写（贴合该方言的语音拼写）的转录文本。 - 示例：`"ข้อยกะบ่ค่อยมีแฮง"` - **name**（字符串类型）：原始文件名，通常包含元数据编码（例如：`opentyphoon;is;x_061;gen;0049.wav`）。 - **district**（字符串类型）：受访者所属的县（Amphoe，泰国县级行政单位）。 - **province**（字符串类型）：受访者所属的府（Changwat，泰国省级行政单位）。 - **age**（整数类型）：受访者的年龄。 - **gender**（字符串类型）：受访者的性别，取值包括`"m"`（男性）、`"f"`（女性）与`"x"`（未指定）。 - **question_id**（字符串类型）：向受访者提出的提示问题的ID。 - **question**（字符串类型）：向受访者提出的问题文本。 - **duration**（浮点类型）：音频片段的时长，单位为秒。 ## 核心特征 ### 1. 丰富的标注格式 `raw`字段提供了伊桑方言与标准泰语之间的专属映射关系，可用于方言归一化与翻译任务。 - 格式：`[方言拼写]<标准泰语拼写>` - 示例：`[เฮา]<เรา>` 表示受访者说出的伊桑语词汇“hao”对应泰语中的“rao”（意为“我们/我们的”）。 ### 2. 人口统计多样性本数据集涵盖伊桑地区多个核心府的受访者，支持区域口音差异分析，涉及的府包括： - 孔敬府（Khon Kaen） - 乌隆他尼府（Udon Thani） - 乌汶叻差他尼府（Ubon Ratchathani） - 猜也蓬府（Chaiyaphum） - 黎逸府（Roi Et） - 马哈沙拉堪府（Maha Sarakham） - 加拉信府（Kalasin） - 农磨兰普府（Nong Bua Lam Phu） - 汶干府（Beung Kan） ### 3. 提示式语音本数据集的录音均为针对特定问题的应答（相关问题可在`question`字段中查看），为语音提供了上下文信息，有助于分析当地方言的语义理解与情感表达。 ## 使用方法 ### 数据集加载 python from datasets import load_dataset import IPython.display as ipd # Load the dataset dataset = load_dataset("scb10x/thai-dialect-isan-dataset") # Select a sample example = dataset['train'][0] # Print transcriptions print(f"Transcript (Isan): {example['isan_spelling']}") print(f"Transcript (Thai): {example['thai_spelling']}") # Listen to audio ipd.Audio(example['audio']['array'], rate=example['audio']['sampling_rate']) ## 附加信息

提供机构：

maas

创建时间：

2025-11-27

搜集汇总

数据集介绍