gigaspeech2

Name: gigaspeech2
Creator: maas
Published: 2026-05-23 15:40:47
License: 暂无描述

魔搭社区2026-05-23 更新2024-06-29 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/gigaspeech2

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Gigaspeech 2 ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Example Usage](#example-usage) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Annotations](#annotations) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Terms of Access](#terms-of-access) ## Dataset Description GigaSpeech 2 is an evolving, large-scale, multi-domain, and multilingual ASR corpus focusing on low-resource languages. GigaSpeech 2 raw comprises about 30,000 hours of automatically transcribed speech, across Thai, Indonesian, and Vietnamese. GigaSpeech 2 refine consists of 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. - **Homepage:** https://github.com/SpeechColab/GigaSpeech2 - **Repository:** https://github.com/SpeechColab/GigaSpeech2 - **Paper:** https://export.arxiv.org/pdf/2406.11546 - **Leaderboard:** https://github.com/SpeechColab/GigaSpeech2#leaderboard - **ModelScope:** https://modelscope.cn/datasets/AI-ModelScope/gigaspeech2 - **Point of Contact:** [gigaspeech@speechcolab.org](mailto:gigaspeech@speechcolab.org) ### Preparation Scripts ```bash pip install lhotse lhotse prepare gigaspeech2 [OPTIONS] CORPUS_DIR OUTPUT_DIR ``` ### Supported Tasks and Leaderboards - `automatic-speech-recognition`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. Evaluation metrics includes Character Error Rate (CER) for Thai, and Word Error Rate (WER) for Indonesian and Vietnamese. The task has an active leaderboard which can be found at https://github.com/SpeechColab/GigaSpeech2#leaderboard and ranks models based on their WER. ### Languages Gigaspeech 2 contains audio and transcription data in Thai, Indonesian, and Vietnamese. ## Dataset Structure ``` GigaSpeech 2 ├── data │ ├── id │ │ ├── md5 │ │ ├── dev.tar.gz │ │ ├── dev.tsv │ │ ├── test.tar.gz │ │ ├── test.tsv │ │ ├── train │ │ │ ├── 0.tar.gz │ │ │ ├── 1.tar.gz │ │ │ └── ... │ │ ├── train_raw.tsv │ │ └── train_refined.tsv │ ├── th │ │ ├── md5 │ │ ├── dev.tar.gz │ │ ├── dev.tsv │ │ ├── test.tar.gz │ │ ├── test.tsv │ │ ├── train │ │ │ ├── 0.tar.gz │ │ │ ├── 1.tar.gz │ │ │ └── ... │ │ ├── train_raw.tsv │ │ └── train_refined.tsv │ └── vi │ ├── md5 │ ├── dev.tar.gz │ ├── dev.tsv │ ├── test.tar.gz │ ├── test.tsv │ ├── train │ │ ├── 0.tar.gz │ │ ├── 1.tar.gz │ │ └── ... │ ├── train_raw.tsv │ └── train_refined.tsv ├── metadata.json └── README.md ``` ### Data Instances ``` Audio file (.wav): Channels: 1 Sample Rate: 16000 Sample Encoding: 16-bit Signed Integer PCM Transcript file (.tsv): <segment_id>\t<text>\n ``` ### Data Fields - segment_id (string) - string id of the segment. - text (string) - transcription of the segment. ### Data Splits The dataset has three subsets for each language: train, dev, and test. The train set has two configurations: raw and refined. train_raw contains all the data from train_refined. #### Transcribed Training Subsets Size | | Thai (hours) | Indonesian (hours) | Vietnamese (hours) | |:--------------------:|:------------:|:------------------:|:------------------:| | GigaSpeech 2 raw | 12901.8 | 8112.9 | 7324.0 | | GigaSpeech 2 refined | 10262.0 | 5714.0 | 6039.0 | GigaSpeech 2 raw contains all the data from GigaSpeech 2 refined. #### Transcribed Evaluation Subsets | | Thai (hours) | Indonesian (hours) | Vietnamese (hours) | |:--------------------:|:------------:|:------------------:|:------------------:| | GigaSpeech 2 dev | 10.0 | 10.0 | 10.2 | | GigaSpeech 2 test | 10.0 | 10.0 | 11.0 | ## Dataset Creation ### Source Data * GigaSpeech 2 raw: 30,000 hours of automatically transcribed speech across Thai, Indonesian, and Vietnamese. * GigaSpeech 2 refined: 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. * GigaSpeech 2 DEV & TEST: 10 hours for DEV and 10 hours for TEST per language, **transcribed by professional human annotators**, challenging and realistic. ### Annotations #### Who are the annotators? Development (DEV) and test (TEST) subsets are annotated by professional human annotators. ### Licensing Information SpeechColab does not own the copyright of the audio files. For researchers and educators who wish to use the audio files for non-commercial research and/or educational purposes, we can provide access through our site under certain conditions and terms. In general, when training a machine learning model on a given dataset, the license of the model is **independent** to that of the dataset. That is to say, speech recognition models trained on the GigaSpeech dataset may be eligible for commercial license, provided they abide to the 'Fair Use' terms of the underlying data and do not violate any explicit copyright restrictions. This is likely to be true in most use-cases. However, it is your responsiblity to verify the appropriate model license for your specific use-case by confirming that the dataset usage abides by the Fair Use terms. SpeechColab is not responsible for the license of any machine learning model trained on the GigaSpeech dataset. ### Citation Information Please cite this paper if you find this work useful: ```bibtext @inproceedings{gigaspeech2, title={GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement}, author={Yifan Yang and Zheshu Song and Jianheng Zhuo and Mingyu Cui and Jinpeng Li and Bo Yang and Yexing Du and Ziyang Ma and Xunying Liu and Ziyuan Wang and Ke Li and Shuai Fan and Kai Yu and Wei-Qiang Zhang and Guoguo Chen and Xie Chen}, booktitle={Proc. ACL}, year={2025}, address={Vienna}, } ``` ## Terms of Access The "Researcher" has requested permission to use the GigaSpeech 2 database (the "Database") at Tsinghua University. In exchange for such permission, Researcher hereby agrees to the following terms and conditions: 1. Researcher shall use the Database only for non-commercial research and educational purposes. 2. The SpeechColab team and Tsinghua University make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. 3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the SpeechColab team and Tsinghua University, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted audio files that he or she may create from the Database. 4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. 5. The SpeechColab team and Tsinghua University reserve the right to terminate Researcher's access to the Database at any time. 6. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.

# GigaSpeech 2 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [示例用法](#example-usage) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [源数据](#source-data) - [标注信息](#annotations) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [使用条款](#terms-of-access) ## 数据集描述 GigaSpeech 2 是一款持续演进的大规模多领域多语言自动语音识别（Automatic Speech Recognition，ASR）语料库，聚焦低资源语言。GigaSpeech 2 原始数据集（raw）包含约30000小时的自动转录语音数据，涵盖泰语、印尼语与越南语。GigaSpeech 2 精修数据集（refine）包含10000小时泰语数据，以及印尼语、越南语各6000小时数据。 - **主页**：https://github.com/SpeechColab/GigaSpeech2?tab=readme-ov-file#gigaspeech-2 - **代码仓库**：https://github.com/SpeechColab/GigaSpeech2 - **论文**：https://export.arxiv.org/pdf/2406.11546 - **排行榜**：https://github.com/SpeechColab/GigaSpeech2#leaderboard - **ModelScope**：https://modelscope.cn/datasets/AI-ModelScope/gigaspeech2 - **联系方式**：[gigaspeech@speechcolab.org](mailto:gigaspeech@speechcolab.org) ### 示例用法 TODO: ### 支持任务与排行榜 - `自动语音识别`（automatic-speech-recognition）：该数据集可用于训练自动语音识别（Automatic Speech Recognition，ASR）模型。模型接收音频文件，并需将其转录为书面文本。评估指标方面，泰语采用字符错误率（Character Error Rate，CER），印尼语与越南语采用词错误率（Word Error Rate，WER）。该任务设有活跃排行榜，可访问 https://github.com/SpeechColab/GigaSpeech2#leaderboard 查看，排行榜基于模型的WER值进行排名。 ### 语言 GigaSpeech 2 包含泰语、印尼语、越南语的音频与转录数据。 ## 数据集结构 shell GigaSpeech 2 ├── data │ ├── id（印尼语） │ │ ├── md5 │ │ ├── dev.tar.gz │ │ ├── dev.tsv │ │ ├── test.tar.gz │ │ ├── test.tsv │ │ ├── train │ │ │ ├── 0.tar.gz │ │ │ ├── 1.tar.gz │ │ │ └── ... │ │ ├── train_raw.tsv │ │ └── train_refined.tsv │ ├── th（泰语） │ │ ├── md5 │ │ ├── dev.tar.gz │ │ ├── dev.tsv │ │ ├── test.tar.gz │ │ ├── test.tsv │ │ ├── train │ │ │ ├── 0.tar.gz │ │ │ ├── 1.tar.gz │ │ │ └── ... │ │ ├── train_raw.tsv │ │ └── train_refined.tsv │ └── vi（越南语） │ ├── md5 │ ├── dev.tar.gz │ ├── dev.tsv │ ├── test.tar.gz │ ├── test.tsv │ ├── train │ │ ├── 0.tar.gz │ │ ├── 1.tar.gz │ │ └── ... │ ├── train_raw.tsv │ └── train_refined.tsv ├── metadata.json └── README.md ### 数据实例 shell 音频文件（.wav）：声道数：1 采样率：16000Hz 采样编码：16位有符号整数PCM 转录文件（.tsv）： <分段ID> <转录文本> ### 数据字段 - segment_id（字符串）：分段的唯一字符串标识符。 - text（字符串）：该语音分段的转录文本。 ### 数据划分该数据集针对每种语言均包含三个子集：训练集（train）、开发集（dev）与测试集（test）。训练集设有两种配置：原始（raw）与精修（refined），原始训练集包含精修训练集的全部数据。 #### 已转录训练子集规模 | | 泰语（小时） | 印尼语（小时） | 越南语（小时） | |:--------------------:|:------------:|:------------------:|:------------------:| | GigaSpeech 2 原始数据集 | 12901.8 | 8112.9 | 7324.0 | | GigaSpeech 2 精修数据集 | 10262.0 | 5714.0 | 6039.0 | GigaSpeech 2 原始数据集包含GigaSpeech 2 精修数据集的全部数据。 #### 已转录评估子集 | | 泰语（小时） | 印尼语（小时） | 越南语（小时） | |:--------------------:|:------------:|:------------------:|:------------------:| | GigaSpeech 2 开发集 | 10.0 | 10.0 | 10.2 | | GigaSpeech 2 测试集 | 10.0 | 10.0 | 11.0 | ## 数据集构建 ### 源数据 * GigaSpeech 2 原始数据集：涵盖泰语、印尼语与越南语的30000小时自动转录语音数据。 * GigaSpeech 2 精修数据集：泰语10000小时，印尼语与越南语各6000小时。 * GigaSpeech 2 开发与测试子集：每种语言的开发集为10小时，测试集为10小时（越南语测试集为11小时），**由专业人工标注员转录**，兼具挑战性与实用性。 ### 标注信息 #### 标注人员来源开发集与测试集的标注工作由专业人工标注员完成。 ### 授权信息 SpeechColab 不享有音频文件的版权。对于希望将音频文件用于非商业研究或教育用途的研究人员与教育工作者，我们可在满足特定条件与条款的前提下，通过本平台提供访问权限。一般而言，在给定数据集上训练机器学习模型时，模型的授权与数据集的授权相互独立。也就是说，基于GigaSpeech 2 数据集训练的语音识别模型，若遵守底层数据的“合理使用”条款且未违反任何明确的版权限制，则可申请商业授权。这在大多数使用场景中均适用，但您需自行确认数据集的使用符合合理使用条款，以确保针对特定使用场景的模型授权合规。SpeechColab 不对基于GigaSpeech 2 数据集训练的任何机器学习模型的授权承担责任。 ### 引用信息若您认为本工作对您有所帮助，请引用以下论文： bibtext @article{gigaspeech2, title={GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement}, author={Yifan Yang and Zheshu Song and Jianheng Zhuo and Mingyu Cui and Jinpeng Li and Bo Yang and Yexing Du and Ziyang Ma and Xunying Liu and Ziyuan Wang and Ke Li and Shuai Fan and Kai Yu and Wei-Qiang Zhang and Guoguo Chen and Xie Chen}, journal={arXiv preprint arXiv:2406.11546}, year={2024}, } ## 使用条款 “研究人员”已申请获得在清华大学使用GigaSpeech 2 数据库（下称“数据库”）的权限。作为获取该权限的交换条件，研究人员特此同意以下条款与条件： 1. 研究人员仅可将该数据库用于非商业研究与教育用途。 2. SpeechColab团队与清华大学不对该数据库作出任何明示或默示的保证，包括但不限于不侵权或适用于特定用途的保证。 3. 研究人员需对其使用该数据库的行为承担全部责任，并需为因使用该数据库（包括但不限于研究人员从数据库中生成的任何受版权保护的音频文件副本）而产生的所有索赔，对SpeechColab团队与清华大学及其雇员、受托人、管理人员和代理人进行辩护与赔偿。 4. 研究人员可向其研究助手与同事提供该数据库的访问权限，前提是他们首先同意受本条款约束。 5. SpeechColab团队与清华大学保留随时终止研究人员对该数据库的访问权限的权利。 6. 若研究人员受雇于营利性商业实体，则其雇主也需受本条款约束，且研究人员特此声明其已获得充分授权，可代表其雇主签署本协议。

提供机构：

maas

创建时间：

2024-06-22

搜集汇总

数据集介绍