Murple/ksponspeech

Name: Murple/ksponspeech
Creator: Murple
Published: 2022-11-14 02:41:37
License: 暂无描述

Hugging Face2022-11-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Murple/ksponspeech

下载链接

链接失效反馈

官方服务：

资源简介：

KsponSpeech数据集包含969小时的韩语对话，由约2000名母语为韩语的说话者在清洁环境中录制。所有数据通过记录两人自由对话的音频并手动转录而成。转录提供了正字法和发音的双重转录，以及用于表示口语自发性的不流畅标签，如填充词、重复词和词片段。该数据集主要用于自动语音识别任务，并且已经在韩国政府的开放数据平台上公开发布。

The KsponSpeech dataset consists of 969 hours of Korean conversational speech, recorded by approximately 2,000 native Korean speakers in a clean environment. All data are compiled from manually transcribed audio recordings of unconstrained two-person dialogues. The transcriptions include both orthographic and phonetic transcriptions, along with disfluency tags to represent the spontaneity of spoken language, such as filled pauses, repeated words, and word fragments. This dataset is primarily utilized for automatic speech recognition (ASR) tasks, and has been publicly released on the open data platform of the South Korean government.

提供机构：

Murple

原始信息汇总

数据集概述

数据集名称

名称: KsponSpeech

数据集属性

语言: 韩语 (ko)
语言创建方式: 众包 (crowdsourced)
多语言性: 单语种 (monolingual)
注释创建方式: 专家生成 (expert-generated)
大小: 10K<n<100K
源数据: 原始数据 (original)
任务类别: 自动语音识别 (automatic-speech-recognition)

数据集描述

摘要: 包含969小时的通用开放领域对话语音，由约2000名母语为韩语的说话者在清洁环境中录制。数据通过记录两人自由对话并手动转录构建。转录提供正字法和发音的双重转录，以及如填充词、重复词和词片段等自发语音的不流畅标签。
支持任务: 自动语音识别
语言: 韩语

数据集结构

数据实例: 每个实例包含音频信息（路径、数组、采样率）、文本转录和唯一ID。
数据字段:
- 音频: 包含音频文件路径、解码音频数组和采样率。
- 文本: 音频文件的转录。
- ID: 数据样本的唯一标识。
数据分割: 包括训练集、验证集和两个评估集（eval.clean 和 eval.other）。

数据集创建

源数据: 数据通过记录两人自由对话并手动转录构建。
注释: 提供正字法和发音的双重转录，以及自发语音的不流畅标签。

引用信息

bibtex @Article{app10196936, AUTHOR = {Bang, Jeong-Uk and Yun, Seung and Kim, Seung-Hi and Choi, Mu-Yeol and Lee, Min-Kyu and Kim, Yeo-Jeong and Kim, Dong-Hyun and Park, Jun and Lee, Young-Jik and Kim, Sang-Hun}, TITLE = {KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition}, JOURNAL = {Applied Sciences}, VOLUME = {10}, YEAR = {2020}, NUMBER = {19}, ARTICLE-NUMBER = {6936}, URL = {https://www.mdpi.com/2076-3417/10/19/6936}, ISSN = {2076-3417}, ABSTRACT = {This paper introduces a large-scale spontaneous speech corpus of Korean, named KsponSpeech. This corpus contains 969 h of general open-domain dialog utterances, spoken by about 2000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. This paper also presents the baseline performance of an end-to-end speech recognition model trained with KsponSpeech. In addition, we investigated the performance of standard end-to-end architectures and the number of sub-word units suitable for Korean. We investigated issues that should be considered in spontaneous speech recognition in Korean. KsponSpeech is publicly available on an open data hub site of the Korea government.}, DOI = {10.3390/app10196936} }

搜集汇总

数据集介绍

构建方式

Murple/ksponspeech数据集的构建是通过记录大约2000名母语韩国人自由对话的音频，并在清洁环境下手动转录这些发音而形成的。该数据集包含了969小时的通用开放域对话发音，转录提供了正字法和发音的双语转录，以及对于自发语音中的停顿、重复词汇和词汇片段等不流畅标记。

特点

该数据集的特点在于其包含了大量的自发韩国对话，这些对话的转录既包括正字法也包括发音，且对不流畅的语音进行了标注。这使得数据集在自动语音识别任务中尤为宝贵，能够帮助模型更好地理解和处理自然语言中的不流畅现象。

使用方法

使用Murple/ksponspeech数据集时，研究者可以访问音频文件的路径、解码后的音频数组以及采样率等信息。数据集提供了唯一的样本ID，便于检索和管理。数据集分为训练集、验证集和测试集，其中测试集进一步分为清洁环境和其他环境，以适应不同的评估需求。

背景与挑战

背景概述

KsponSpeech语音数据集，创建于2020年，由韩国成均馆大学的Jeong-Uk Bang等研究人员负责构建。该数据集是韩国语自动语音识别领域的重要资源，包含2000名母语韩国人在清晰环境下自由对话的969小时录音及其转录文本。这些转录文本不仅包括正字法和发音，还标注了语言的自然特性，如填充词、重复词和词片段等不流畅标记。KsponSpeech的创建旨在推进韩国语自发语音识别技术的研究，并已在学术界产生了广泛影响，为相关研究提供了宝贵的数据资源。

当前挑战

在构建过程中，KsponSpeech面临了多个挑战。首先，确保对话的自然性和多样性是一项重要任务。其次，对大量语音数据进行精确转录和标注，需要耗费大量时间和人力资源。此外，数据集中个人隐私信息的处理和保护也是一大挑战。在研究领域，如何有效利用这些数据训练出高性能的自动语音识别模型，处理自发语音中的不流畅性，以及克服由于数据偏差可能带来的模型性能局限等问题，都是当前研究者需要面对的挑战。

常用场景

经典使用场景

在自动语音识别领域，Murple/ksponspeech数据集的经典使用场景在于构建和训练端到端的语音识别模型。该数据集提供了丰富的韩语自发对话录音及其转录，有助于模型学习并处理自然语言中的不流畅元素，如停顿、重复和犹豫等。

衍生相关工作

基于Murple/ksponspeech数据集，研究者们开展了多项相关工作，如探索适合韩语的子词单元数量、改进端到端架构的性能，以及评估不同模型在自发语音识别中的表现，这些研究进一步推动了韩语自动语音识别技术的发展。

数据集最近研究