warisqr007/GAPS-nptel
收藏Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/warisqr007/GAPS-nptel
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: original_non_native_audio
dtype:
audio:
sampling_rate: 16000
- name: parallel_native_audio
dtype:
audio:
sampling_rate: 16000
- name: golden_speaker_audio
dtype:
audio:
sampling_rate: 16000
- name: transcript
dtype: string
- name: speaker_id
dtype: string
- name: utterance_id
dtype: string
splits:
- name: train
num_bytes: 1462052810048.54
num_examples: 1423460
download_size: 1499627567221
dataset_size: 1462052810048.54
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-4.0
task_categories:
- audio-to-audio
- automatic-speech-recognition
- audio-classification
language:
- en
tags:
- Accent-conversion
- streaming-accent-conversion
- speech-synthesis
- speech
- accents
- golden-speaker
- accented-english
size_categories:
- 100K<n<1M
---
# GAPS: Golden-Aligned Parallel Speech Corpus
## Overview
**GAPS (Golden-Aligned Parallel Speech)** is a multi-corpus dataset designed for **foreign accent conversion**.
The dataset provides **parallel speech triplets** consisting of:
- **Original non-native speech**
- **Parallel native speech**
- **Golden speaker speech** — synthetic speech that preserves the non-native speaker’s **timbre and timing** (including pauses) while exhibiting **native pronunciation**
along with the corresponding **text transcript**.
GAPS is constructed to support both **offline accent conversion** and **streaming, low-latency pronunciation correction**, and is used in our work on **streaming foreign accent conversion for voice anonymization**.
---
## Dataset Structure
This repository extends the **GAPS**(https://huggingface.co/datasets/warisqr007/GAPS) to the NPTEL lecture corpus
The dataset contains the following main columns:
| Column name | Type | Description |
|--------------------|--------|-------------|
| `original` | Audio | Original non-native speech |
| `parallel_native` | Audio | Parallel native speech |
| `golden_speaker` | Audio | Golden speaker speech (synthetic) |
| `transcript` | string | Text transcription |
All audio is **single-channel, 16 kHz**.
---
## Dataset Statistics
| Speakers | Duration (approx.) |
|----------|--------------------|
| TBD | TBD hours |
*(Statistics will be updated soon.)*
---
## Data Construction
### Source Corpora
GAPS-nptel extends GAPS to include **NPTEL (BhasaAnuvaad)** that contains lecture speech from Indian English speakers.
The original datasets are **not redistributed in raw form**.
GAPS provides **processed, aligned, and synthesized derivatives**, following the original licenses.
---
### Golden Speaker Generation
Golden speaker utterances are generated **entirely offline** using a **two-stage, reference-free accent conversion pipeline**, redesigned for **duration preservation** and **streaming compatibility**.
For each non-native / native utterance pair:
**1. Content Extraction**
Linguistic content representations are extracted independently from the native and non-native utterances using a speaker-independent content encoder.
**2. Silence-Aware DTW Alignment**
- Voice Activity Detection (VAD) is applied to remove silence regions.
- Dynamic Time Warping (DTW) is performed in the content embedding space.
- Native content embeddings are temporally aligned to the non-native utterance.
- Silence segments are re-inserted to preserve the original non-native timing and rhythm.
**3. Golden Speaker Synthesis**
- Aligned native content embeddings provide **native pronunciation**.
- Non-native speaker embeddings provide **speaker identity (timbre)**.
- Duration and rhythm follow the **non-native utterance**.
- Waveforms are synthesized using a zero-shot voice conversion system and neural vocoder.
The resulting golden speaker speech differs from the original non-native speech **only in accent**, making it suitable as supervision for pronunciation correction and accent translation.
---
## Intended Use
GAPS is intended for research on:
- Foreign accent conversion (FAC)
- Accent-aware speaker anonymization
- Streaming pronunciation correction
- Accent analysis and evaluation
The dataset is **not intended for commercial use**, unless explicitly permitted under the original licenses.
---
## Example Usage
```python
from datasets import load_dataset
ds = load_dataset("warisqr007/GAPS-nptel")
# Access a specific split
sample = ds[0]
# Audio is loaded lazily
audio = sample["original"]
print(audio["sampling_rate"], audio["array"].shape)
print(sample["transcript"])
```
## Licenses and Usage Terms
Each subset of GAPS follows the same license as its original dataset.
### NPTEL / BhasaAnuvaad
- License: **CC BY-NC 4.0**
- Summary: https://creativecommons.org/licenses/by-nc/4.0/
- Full license: https://creativecommons.org/licenses/by-nc/4.0/legalcode
- Hugging Face dataset: https://huggingface.co/datasets/ai4bharat/NPTEL
This processed dataset follows the same license.
For any usage not covered by this license, please contact the dataset authors and **cite the BhasaAnuvaad paper**.
## Citation
If you use GAPS in your research, please cite:
### GAPS-NPTEL (this dataset)
```bibtex
@misc{quamer2026phonos,
title={PHONOS: PHOnetic Neutralization for Online Streaming Applications},
author={Waris Quamer and Mu-Ruei Tseng and Ghady Nasrallah and Ricardo Gutierrez-Osuna},
year={2026},
eprint={2603.27001},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.27001},
}
```
### NPTEL / BhasaAnuvaad
```bibtex
@article{jain2024bhasaanuvaad,
title = {BhasaAnuvaad: A Speech Translation Dataset for 14 Indian Languages},
author = {Jain, Sparsh and Sankar, Ashwin and Choudhary, Devilal and Suman, Dhairya and Narasimhan, Nikhil and Khan, Mohammed Safi Ur Rahman and Kunchukuttan, Anoop and Khapra, Mitesh M and Dabre, Raj},
journal = {arXiv preprint arXiv:2411.04699},
year = {2024}
}
```
提供机构:
warisqr007



