warisqr007/GAPS-nptel

Name: warisqr007/GAPS-nptel
Creator: warisqr007
Published: 2026-03-31 04:06:53
License: 暂无描述

Hugging Face2026-03-31 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/warisqr007/GAPS-nptel

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: original_non_native_audio dtype: audio: sampling_rate: 16000 - name: parallel_native_audio dtype: audio: sampling_rate: 16000 - name: golden_speaker_audio dtype: audio: sampling_rate: 16000 - name: transcript dtype: string - name: speaker_id dtype: string - name: utterance_id dtype: string splits: - name: train num_bytes: 1462052810048.54 num_examples: 1423460 download_size: 1499627567221 dataset_size: 1462052810048.54 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-4.0 task_categories: - audio-to-audio - automatic-speech-recognition - audio-classification language: - en tags: - Accent-conversion - streaming-accent-conversion - speech-synthesis - speech - accents - golden-speaker - accented-english size_categories: - 100K<n<1M --- # GAPS: Golden-Aligned Parallel Speech Corpus ## Overview **GAPS (Golden-Aligned Parallel Speech)** is a multi-corpus dataset designed for **foreign accent conversion**. The dataset provides **parallel speech triplets** consisting of: - **Original non-native speech** - **Parallel native speech** - **Golden speaker speech** — synthetic speech that preserves the non-native speaker’s **timbre and timing** (including pauses) while exhibiting **native pronunciation** along with the corresponding **text transcript**. GAPS is constructed to support both **offline accent conversion** and **streaming, low-latency pronunciation correction**, and is used in our work on **streaming foreign accent conversion for voice anonymization**. --- ## Dataset Structure This repository extends the **GAPS**(https://huggingface.co/datasets/warisqr007/GAPS) to the NPTEL lecture corpus The dataset contains the following main columns: | Column name | Type | Description | |--------------------|--------|-------------| | `original` | Audio | Original non-native speech | | `parallel_native` | Audio | Parallel native speech | | `golden_speaker` | Audio | Golden speaker speech (synthetic) | | `transcript` | string | Text transcription | All audio is **single-channel, 16 kHz**. --- ## Dataset Statistics | Speakers | Duration (approx.) | |----------|--------------------| | TBD | TBD hours | *(Statistics will be updated soon.)* --- ## Data Construction ### Source Corpora GAPS-nptel extends GAPS to include **NPTEL (BhasaAnuvaad)** that contains lecture speech from Indian English speakers. The original datasets are **not redistributed in raw form**. GAPS provides **processed, aligned, and synthesized derivatives**, following the original licenses. --- ### Golden Speaker Generation Golden speaker utterances are generated **entirely offline** using a **two-stage, reference-free accent conversion pipeline**, redesigned for **duration preservation** and **streaming compatibility**. For each non-native / native utterance pair: **1. Content Extraction** Linguistic content representations are extracted independently from the native and non-native utterances using a speaker-independent content encoder. **2. Silence-Aware DTW Alignment** - Voice Activity Detection (VAD) is applied to remove silence regions. - Dynamic Time Warping (DTW) is performed in the content embedding space. - Native content embeddings are temporally aligned to the non-native utterance. - Silence segments are re-inserted to preserve the original non-native timing and rhythm. **3. Golden Speaker Synthesis** - Aligned native content embeddings provide **native pronunciation**. - Non-native speaker embeddings provide **speaker identity (timbre)**. - Duration and rhythm follow the **non-native utterance**. - Waveforms are synthesized using a zero-shot voice conversion system and neural vocoder. The resulting golden speaker speech differs from the original non-native speech **only in accent**, making it suitable as supervision for pronunciation correction and accent translation. --- ## Intended Use GAPS is intended for research on: - Foreign accent conversion (FAC) - Accent-aware speaker anonymization - Streaming pronunciation correction - Accent analysis and evaluation The dataset is **not intended for commercial use**, unless explicitly permitted under the original licenses. --- ## Example Usage ```python from datasets import load_dataset ds = load_dataset("warisqr007/GAPS-nptel") # Access a specific split sample = ds[0] # Audio is loaded lazily audio = sample["original"] print(audio["sampling_rate"], audio["array"].shape) print(sample["transcript"]) ``` ## Licenses and Usage Terms Each subset of GAPS follows the same license as its original dataset. ### NPTEL / BhasaAnuvaad - License: **CC BY-NC 4.0** - Summary: https://creativecommons.org/licenses/by-nc/4.0/ - Full license: https://creativecommons.org/licenses/by-nc/4.0/legalcode - Hugging Face dataset: https://huggingface.co/datasets/ai4bharat/NPTEL This processed dataset follows the same license. For any usage not covered by this license, please contact the dataset authors and **cite the BhasaAnuvaad paper**. ## Citation If you use GAPS in your research, please cite: ### GAPS-NPTEL (this dataset) ```bibtex @misc{quamer2026phonos, title={PHONOS: PHOnetic Neutralization for Online Streaming Applications}, author={Waris Quamer and Mu-Ruei Tseng and Ghady Nasrallah and Ricardo Gutierrez-Osuna}, year={2026}, eprint={2603.27001}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2603.27001}, } ``` ### NPTEL / BhasaAnuvaad ```bibtex @article{jain2024bhasaanuvaad, title = {BhasaAnuvaad: A Speech Translation Dataset for 14 Indian Languages}, author = {Jain, Sparsh and Sankar, Ashwin and Choudhary, Devilal and Suman, Dhairya and Narasimhan, Nikhil and Khan, Mohammed Safi Ur Rahman and Kunchukuttan, Anoop and Khapra, Mitesh M and Dabre, Raj}, journal = {arXiv preprint arXiv:2411.04699}, year = {2024} } ```

提供机构：

warisqr007

5,000+

优质数据集

54 个

任务类型

进入经典数据集