warisqr007/GAPS

Name: warisqr007/GAPS
Creator: warisqr007
Published: 2026-02-24 19:16:24
License: 暂无描述

Hugging Face2026-02-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/warisqr007/GAPS

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: original_non_native_audio dtype: audio: sampling_rate: 16000 - name: parallel_native_audio dtype: audio: sampling_rate: 16000 - name: golden_speaker_audio dtype: audio: sampling_rate: 16000 - name: transcript dtype: string - name: speaker_id dtype: string - name: utterance_id dtype: string splits: - name: l2arctic num_bytes: 13478053168.228 num_examples: 26813 - name: indictts num_bytes: 104948748505.534 num_examples: 149933 download_size: 116460391826 dataset_size: 118426801673.762 configs: - config_name: default data_files: - split: l2arctic path: data/l2arctic-* - split: indictts path: data/indictts-* license: cc-by-4.0 task_categories: - audio-to-audio - automatic-speech-recognition - audio-classification language: - en tags: - Speech - Accent-Conversion - golden-speaker - accented-english - speech-synthesis - streaming-accent-conversion size_categories: - 100K<n<1M --- # GAPS: Golden-Aligned Parallel Speech Corpus ## Overview **GAPS (Golden-Aligned Parallel Speech)** is a multi-corpus dataset designed for **foreign accent conversion**. The dataset provides **parallel speech triplets** consisting of: - **Original non-native speech** - **Parallel native speech** - **Golden speaker speech** — synthetic speech that preserves the non-native speaker’s **timbre and timing** (including pauses) while exhibiting **native pronunciation** along with the corresponding **text transcript**. GAPS is constructed to support both **offline accent conversion** and **streaming, low-latency pronunciation correction**, and is used in our work on **streaming foreign accent conversion for voice anonymization**. --- ## Dataset Structure GAPS is released as a single Hugging Face dataset with **two splits**, corresponding to the source corpora: - `l2arctic` - `indictts` Each split contains the following columns: | Column name | Type | Description | |--------------------|--------|-------------| | `original` | Audio | Original non-native speech | | `parallel_native` | Audio | Parallel native speech | | `golden_speaker` | Audio | Golden speaker speech (synthetic) | | `transcript` | string | Text transcription | All audio is **single-channel, 16 kHz**. Note: Also see **GAPS-nptel**(https://huggingface.co/datasets/warisqr007/GAPS-nptel), that extends same technique to NPTEL lecture corpus (https://huggingface.co/datasets/ai4bharat/NPTEL) --- ## Dataset Statistics | Split | Speakers | Duration (approx.) | |-----------|----------|--------------------| | l2arctic | 24 | TBD hours | | indictts | 25 | TBD hours | | **Total** | 49 | TBD hours | *(Statistics will be updated soon.)* --- ## Data Construction ### Source Corpora GAPS is built on top of two publicly available speech datasets: - **L2-ARCTIC**: non-native English speech with parallel native references from CMU arctic corpus - **IndicTTS**: Indian-accented English speech The original datasets are **not redistributed in raw form**. GAPS provides **processed, aligned, and synthesized derivatives**, following the original licenses. --- ### Golden Speaker Generation Golden speaker utterances are generated **entirely offline** using a **two-stage, reference-free accent conversion pipeline**, redesigned for **duration preservation** and **streaming compatibility**. For each non-native / native utterance pair: **1. Content Extraction** Linguistic content representations are extracted independently from the native and non-native utterances using a speaker-independent content encoder. **2. Silence-Aware DTW Alignment** - Voice Activity Detection (VAD) is applied to remove silence regions. - Dynamic Time Warping (DTW) is performed in the content embedding space. - Native content embeddings are temporally aligned to the non-native utterance. - Silence segments are re-inserted to preserve the original non-native timing and rhythm. **3. Golden Speaker Synthesis** - Aligned native content embeddings provide **native pronunciation**. - Non-native speaker embeddings provide **speaker identity (timbre)**. - Duration and rhythm follow the **non-native utterance**. - Waveforms are synthesized using a zero-shot voice conversion system and neural vocoder. The resulting golden speaker speech differs from the original non-native speech **only in accent**, making it suitable as supervision for pronunciation correction and accent translation. --- ## Intended Use GAPS is intended for research on: - Foreign accent conversion (FAC) - Accent-aware speaker anonymization - Streaming pronunciation correction - Accent analysis and evaluation The dataset is **not intended for commercial use**, unless explicitly permitted under the original licenses. --- ## Example Usage ```python from datasets import load_dataset ds = load_dataset("warisqr007/GAPS") # Access a specific split sample = ds["l2arctic"][0] # Audio is loaded lazily audio = sample["original"] print(audio["sampling_rate"], audio["array"].shape) print(sample["transcript"]) ``` ## Licenses and Usage Terms Each subset of GAPS follows the same license as its original dataset. ### L2-ARCTIC - License: **CC BY-NC 4.0** - Summary: https://creativecommons.org/licenses/by-nc/4.0/ - Full license: https://creativecommons.org/licenses/by-nc/4.0/legalcode This processed dataset follows the same license. For any usage not covered by this license, please contact the dataset authors and **cite the L2-ARCTIC paper**. ### IndicTTS - License: **CC BY-NC 4.0** - Dataset: https://www.iitm.ac.in/donlab/indictts/database This processed dataset follows the same license. For any usage not covered by this license, please contact the dataset authors and **cite the IndicTTS paper**. ## Citation If you use GAPS in your research, please cite: ### GAPS (this dataset) ```bibtex @article{gaps2026, title = {GAPS: Golden-Aligned Parallel Speech Corpus for Accent Conversion and Anonymization}, author = {TBD}, journal = {TBD}, year = {2026} } ``` *(Placeholder — update once the paper is public.)* ### L2-ARCTIC ```bibtex @inproceedings{zhao2018l2, title={L2-ARCTIC: A Non-native English Speech Corpus}, author={Zhao, Guanlong and Sonsaat, Sinem and Silpachai, Alif and Lucic, Ivana and Chukharev-Hudilainen, Evgeny and Levis, John and Gutierrez-Osuna, Ricardo}, booktitle={Proc. Interspeech}, pages={2783--2787}, year={2018} } ``` ### IndicTTS ```bibtex @inproceedings{baby2016resources, title={Resources for Indian languages}, author={Baby, A. and Thomas, A. L. and N. N. L and Murthy, H. A.}, booktitle={Community-based Building of Language Resources (TSD)}, pages={37--43}, year={2016} } ``` ### CMU Arctic ```bibtex @inproceedings{kominek2004cmu, title={The CMU Arctic speech databases}, author={Kominek, John and Black, Alan W}, booktitle={SSW}, pages={223--224}, year={2004} } ```

提供机构：

warisqr007

5,000+

优质数据集

54 个

任务类型

进入经典数据集