warisqr007/GAPS
收藏Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/warisqr007/GAPS
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: original_non_native_audio
dtype:
audio:
sampling_rate: 16000
- name: parallel_native_audio
dtype:
audio:
sampling_rate: 16000
- name: golden_speaker_audio
dtype:
audio:
sampling_rate: 16000
- name: transcript
dtype: string
- name: speaker_id
dtype: string
- name: utterance_id
dtype: string
splits:
- name: l2arctic
num_bytes: 13478053168.228
num_examples: 26813
- name: indictts
num_bytes: 104948748505.534
num_examples: 149933
download_size: 116460391826
dataset_size: 118426801673.762
configs:
- config_name: default
data_files:
- split: l2arctic
path: data/l2arctic-*
- split: indictts
path: data/indictts-*
license: cc-by-4.0
task_categories:
- audio-to-audio
- automatic-speech-recognition
- audio-classification
language:
- en
tags:
- Speech
- Accent-Conversion
- golden-speaker
- accented-english
- speech-synthesis
- streaming-accent-conversion
size_categories:
- 100K<n<1M
---
# GAPS: Golden-Aligned Parallel Speech Corpus
## Overview
**GAPS (Golden-Aligned Parallel Speech)** is a multi-corpus dataset designed for **foreign accent conversion**.
The dataset provides **parallel speech triplets** consisting of:
- **Original non-native speech**
- **Parallel native speech**
- **Golden speaker speech** — synthetic speech that preserves the non-native speaker’s **timbre and timing** (including pauses) while exhibiting **native pronunciation**
along with the corresponding **text transcript**.
GAPS is constructed to support both **offline accent conversion** and **streaming, low-latency pronunciation correction**, and is used in our work on **streaming foreign accent conversion for voice anonymization**.
---
## Dataset Structure
GAPS is released as a single Hugging Face dataset with **two splits**, corresponding to the source corpora:
- `l2arctic`
- `indictts`
Each split contains the following columns:
| Column name | Type | Description |
|--------------------|--------|-------------|
| `original` | Audio | Original non-native speech |
| `parallel_native` | Audio | Parallel native speech |
| `golden_speaker` | Audio | Golden speaker speech (synthetic) |
| `transcript` | string | Text transcription |
All audio is **single-channel, 16 kHz**.
Note: Also see **GAPS-nptel**(https://huggingface.co/datasets/warisqr007/GAPS-nptel), that extends same technique to NPTEL lecture corpus (https://huggingface.co/datasets/ai4bharat/NPTEL)
---
## Dataset Statistics
| Split | Speakers | Duration (approx.) |
|-----------|----------|--------------------|
| l2arctic | 24 | TBD hours |
| indictts | 25 | TBD hours |
| **Total** | 49 | TBD hours |
*(Statistics will be updated soon.)*
---
## Data Construction
### Source Corpora
GAPS is built on top of two publicly available speech datasets:
- **L2-ARCTIC**: non-native English speech with parallel native references from CMU arctic corpus
- **IndicTTS**: Indian-accented English speech
The original datasets are **not redistributed in raw form**.
GAPS provides **processed, aligned, and synthesized derivatives**, following the original licenses.
---
### Golden Speaker Generation
Golden speaker utterances are generated **entirely offline** using a **two-stage, reference-free accent conversion pipeline**, redesigned for **duration preservation** and **streaming compatibility**.
For each non-native / native utterance pair:
**1. Content Extraction**
Linguistic content representations are extracted independently from the native and non-native utterances using a speaker-independent content encoder.
**2. Silence-Aware DTW Alignment**
- Voice Activity Detection (VAD) is applied to remove silence regions.
- Dynamic Time Warping (DTW) is performed in the content embedding space.
- Native content embeddings are temporally aligned to the non-native utterance.
- Silence segments are re-inserted to preserve the original non-native timing and rhythm.
**3. Golden Speaker Synthesis**
- Aligned native content embeddings provide **native pronunciation**.
- Non-native speaker embeddings provide **speaker identity (timbre)**.
- Duration and rhythm follow the **non-native utterance**.
- Waveforms are synthesized using a zero-shot voice conversion system and neural vocoder.
The resulting golden speaker speech differs from the original non-native speech **only in accent**, making it suitable as supervision for pronunciation correction and accent translation.
---
## Intended Use
GAPS is intended for research on:
- Foreign accent conversion (FAC)
- Accent-aware speaker anonymization
- Streaming pronunciation correction
- Accent analysis and evaluation
The dataset is **not intended for commercial use**, unless explicitly permitted under the original licenses.
---
## Example Usage
```python
from datasets import load_dataset
ds = load_dataset("warisqr007/GAPS")
# Access a specific split
sample = ds["l2arctic"][0]
# Audio is loaded lazily
audio = sample["original"]
print(audio["sampling_rate"], audio["array"].shape)
print(sample["transcript"])
```
## Licenses and Usage Terms
Each subset of GAPS follows the same license as its original dataset.
### L2-ARCTIC
- License: **CC BY-NC 4.0**
- Summary: https://creativecommons.org/licenses/by-nc/4.0/
- Full license: https://creativecommons.org/licenses/by-nc/4.0/legalcode
This processed dataset follows the same license.
For any usage not covered by this license, please contact the dataset authors and **cite the L2-ARCTIC paper**.
### IndicTTS
- License: **CC BY-NC 4.0**
- Dataset: https://www.iitm.ac.in/donlab/indictts/database
This processed dataset follows the same license.
For any usage not covered by this license, please contact the dataset authors and **cite the IndicTTS paper**.
## Citation
If you use GAPS in your research, please cite:
### GAPS (this dataset)
```bibtex
@article{gaps2026,
title = {GAPS: Golden-Aligned Parallel Speech Corpus for Accent Conversion and Anonymization},
author = {TBD},
journal = {TBD},
year = {2026}
}
```
*(Placeholder — update once the paper is public.)*
### L2-ARCTIC
```bibtex
@inproceedings{zhao2018l2,
title={L2-ARCTIC: A Non-native English Speech Corpus},
author={Zhao, Guanlong and Sonsaat, Sinem and Silpachai, Alif and Lucic, Ivana and Chukharev-Hudilainen, Evgeny and Levis, John and Gutierrez-Osuna, Ricardo},
booktitle={Proc. Interspeech},
pages={2783--2787},
year={2018}
}
```
### IndicTTS
```bibtex
@inproceedings{baby2016resources,
title={Resources for Indian languages},
author={Baby, A. and Thomas, A. L. and N. N. L and Murthy, H. A.},
booktitle={Community-based Building of Language Resources (TSD)},
pages={37--43},
year={2016}
}
```
### CMU Arctic
```bibtex
@inproceedings{kominek2004cmu,
title={The CMU Arctic speech databases},
author={Kominek, John and Black, Alan W},
booktitle={SSW},
pages={223--224},
year={2004}
}
```
提供机构:
warisqr007



