SEACrowd/vivos

Name: SEACrowd/vivos
Creator: SEACrowd
Published: 2024-06-24 13:27:25
License: 暂无描述

Hugging Face2024-06-24 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/SEACrowd/vivos

下载链接

链接失效反馈

官方服务：

资源简介：

Vivos是一个越南语语音语料库，包含15小时的录音，用于自动语音识别任务。该语料库由超过50名越南母语者录制。

Vivos is a Vietnamese speech corpus consisting of 15 hours of recorded audio, intended for automatic speech recognition (ASR) tasks. This corpus was recorded by more than 50 Vietnamese native speakers.

提供机构：

SEACrowd

原始信息汇总

Vivos 数据集概述

基本信息

名称: Vivos
语言: 越南语 (vie)
任务类别: 语音识别 (speech-recognition)
标签: 语音识别 (speech-recognition)
许可证: Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)

数据集描述

Vivos 是一个越南语语音语料库，包含 15 小时的录音语音，专为自动语音识别任务准备。该语料库由 50 多名越南语母语者录制。

支持的任务

语音识别 (Speech Recognition)

数据集版本

源版本: 1.0.0
SEACrowd 版本: 2024.06.20

数据集加载

使用 `datasets` 库

python from datasets import load_dataset dset = datasets.load_dataset("SEACrowd/vivos", trust_remote_code=True)

使用 `seacrowd` 库

python import seacrowd as sc

使用默认配置加载数据集

dset = sc.load_dataset("vivos", schema="seacrowd")

检查数据集的所有可用子集（配置名称）

print(sc.available_config_names("vivos"))

使用特定配置加载数据集

dset = sc.load_dataset_by_config_name(config_name="<config_name>")

引用

如果使用 Vivos 数据加载器，请引用以下内容：

@inproceedings{luong-vu-2016-non, title = "A non-expert {K}aldi recipe for {V}ietnamese Speech Recognition System", author = "Luong, Hieu-Thi and Vu, Hai-Quan", editor = "Murakami, Yohei and Lin, Donghui and Ide, Nancy and Pustejovsky, James", booktitle = "Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies ({WLSI}/{OIAF}4{HLT}2016)", month = dec, year = "2016", address = "Osaka, Japan", publisher = "The COLING 2016 Organizing Committee", url = "https://aclanthology.org/W16-5207", pages = "51--55", abstract = "In this paper we describe a non-expert setup for Vietnamese speech recognition system using Kaldi toolkit. We collected a speech corpus over fifteen hours from about fifty Vietnamese native speakers and using it to test the feasibility of our setup. The essential linguistic components for the Automatic Speech Recognition (ASR) system was prepared basing on the written form of the language instead of expertise knowledge on linguistic and phonology as commonly seen in rich resource languages like English. The modeling of tones by integrating them into the phoneme and using the phonetic decision tree is also discussed. Experimental results showed this setup for ASR systems does yield competitive results while still have potentials for further improvements.", }

@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

搜集汇总

数据集介绍

构建方式

在语音识别研究领域，构建高质量语音数据集是推动技术发展的基石。VIVOS数据集的构建过程体现了严谨的学术方法，其核心语料来源于超过50位越南语母语志愿者的录音贡献，总计采集了15小时的语音数据。整个采集流程旨在覆盖自然的口语表达，为自动语音识别任务提供了真实且多样化的训练与评估素材。数据集的构建严格遵循非专家化的Kaldi工具链配置方案，其语言成分的标注主要依据越南语的书面形式，而非依赖深奥的语言学专业知识，这一设计降低了研究门槛，促进了技术的普及与应用。

特点

作为专注于越南语语音识别的资源，VIVOS数据集展现出鲜明的技术特色。其核心价值在于提供了纯净且标注规范的越南语连续语音，时长总计15小时，为模型训练提供了充足的样本。数据集的一个显著特点是其构建理念强调实用性，通过将声调信息整合到音素建模中并利用音素决策树进行处理，有效应对了越南语作为声调语言所带来的识别挑战。这种设计使得基于该数据集开发的语音识别系统能够取得具有竞争力的性能表现，同时保留了进一步优化的潜力。

使用方法

对于希望利用VIVOS数据集的研究者而言，其加载与使用流程清晰便捷。用户可通过主流的`datasets`库，调用`load_dataset("SEACrowd/vivos", trust_remote_code=True)`指令直接获取数据。此外，针对东南亚语言研究社区优化的`seacrowd`库也提供了专门支持，用户可以使用`sc.load_dataset("vivos", schema="seacrowd")`加载默认配置，或通过`sc.available_config_names("vivos")`查询所有可用子集后，使用特定配置名进行加载。详细的库使用指南可在SEACrowd数据中心的GitHub页面查阅，确保了研究工作的可复现性与高效性。

背景与挑战

背景概述

在自动语音识别技术蓬勃发展的背景下，针对资源相对匮乏的语言构建高质量语音数据集成为推动该领域进步的关键。VIVOS数据集由Hieu-Thi Luong与Hai-Quan Vu等研究人员于2016年创建，旨在为越南语自动语音识别系统提供基础训练资源。该数据集收录了超过50位母语者的15小时录音，其核心研究问题聚焦于如何在缺乏专家语言学知识的情况下，仅依据书面语形式构建有效的语音识别模型，特别是对越南语声调进行建模。这一工作为东南亚语言的信息处理研究提供了重要数据支撑，促进了多语言语音技术生态的均衡发展。

当前挑战

VIVOS数据集致力于解决越南语自动语音识别任务中的核心挑战，即如何在声调语言中准确建模音素与声调的复杂交互，同时克服因训练数据规模有限导致的模型泛化能力不足问题。在构建过程中，研究人员面临诸多实际困难：需要协调数十位母语者进行语音采集，确保录音环境的一致性与音频质量；此外，在缺乏详尽音系学指导的条件下，必须依据书面形式设计音素集与声调整合方案，这增加了语言学资源准备的复杂性。这些挑战共同凸显了在资源受限场景下构建可用语音数据集的艰巨性。

常用场景

经典使用场景

在越南语语音识别领域，VIVOS数据集作为一项关键资源，常被用于训练和评估自动语音识别模型。该数据集包含15小时的越南语语音录音，由超过50名母语者录制，覆盖了多样化的发音和语调变化。研究者通常利用其构建端到端的语音识别系统，通过深度学习框架如Kaldi进行声学建模和语言建模，以提升模型在越南语环境下的识别准确率。这一过程不仅验证了数据集的实用性，也为后续的语音技术研究奠定了坚实基础。

衍生相关工作

基于VIVOS数据集，衍生了一系列经典研究工作。例如，Luong和Vu在2016年提出的非专家Kaldi配方，首次利用该数据集构建了越南语语音识别系统，验证了其可行性。后续研究扩展了多模态融合方法，结合文本和语音数据提升识别鲁棒性。SEACrowd项目进一步将VIVOS整合为东南亚语言数据枢纽的一部分，推动了跨语言基准测试的发展。这些工作不仅丰富了越南语语音处理的文献，还激励了更多针对低资源语言的创新模型和工具链的开发。

数据集最近研究