nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- te
pretty_name: Chaganti Voice with CC Dataset
task_categories:
- automatic-speech-recognition
- text-to-speech
tags:
- telugu
- speech
- chaganti
- tts
- asr
- pravachanam
- voice-cloning
size_categories:
- n<1K
dataset_info:
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 4407288414
num_examples: 778
download_size: 4362087261
dataset_size: 4407288414
configs:
- config_name: default
data_files:
- split: train
path: train-*.parquet
---
# Dataset Card for Chaganti Voice with CC Dataset
A curated Telugu speech dataset of Sri Chaganti Koteswara Rao's spiritual discourses (*pravachanams*), paired with auto-generated transcriptions. Built for training or fine-tuning **ASR** and **TTS / Voice Cloning** models in the Telugu language.
## Dataset Details
### Dataset Description
This **4.36 GB** dataset contains high-quality audio clips of Sri Chaganti Koteswara Rao's Telugu spiritual discourses, precisely chunked and aligned with auto-generated YouTube transcriptions. Silence has been trimmed from each clip for optimized ML training pipelines.
- **Curated by:** [nikhilsaipagidimarri](https://huggingface.co/nikhilsaipagidimarri)
- **Funded by:** N/A
- **Shared by:** nikhilsaipagidimarri
- **Language(s) (NLP):** Telugu (`te`)
- **License:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)
- **Total Audio Duration:** +5 hours
- **Total Size:** 4.36 GB (778 instances)
### Dataset Sources
- **Repository:** [nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset](https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset)
- **Paper:** N/A
- **Demo:** N/A
## Uses
### Direct Use
This dataset is intended for:
- Training and fine-tuning **Telugu ASR** (Automatic Speech Recognition) models
- Training and fine-tuning **Telugu TTS** (Text-to-Speech) models
- **Voice cloning** research using a single high-quality Telugu speaker
- Academic research on low-resource Telugu speech
### Out-of-Scope Use
- **Commercial use** of any kind — prohibited under CC BY-NC 4.0
- Multi-speaker speech tasks — this dataset contains a single speaker only
- Speaker diarization or speaker verification tasks
- Any use that violates [YouTube's Terms of Service](https://www.youtube.com/t/terms)
## Dataset Structure
This dataset has been fully optimized into **Parquet shards**. This means you do not need to download raw `.wav` files manually; the Hugging Face `datasets` library will automatically handle the audio decoding and convert the files into numpy arrays for immediate model training.
### File Layout
```text
data/
train-00000-of-00020.parquet
train-00001-of-00020.parquet
train-00002-of-00020.parquet
train-00003-of-00020.parquet
train-00004-of-00020.parquet
train-00005-of-00020.parquet
train-00006-of-00020.parquet
train-00007-of-00020.parquet
train-00008-of-00020.parquet
...
train-00019-of-00020.parquet
```
### Data Instance
Each loaded instance contains:
- **`audio`** — Dictionary containing the raw audio waveform array, sampling rate, and reference path.
- **`transcription`** — Telugu text of what is spoken in the clip.
**Example:**
```python
{
'audio': {
'path': 'utt_001.wav',
'array': array([ 0.0000000e+00, -3.0517578e-05, ...]),
'sampling_rate': 16000
},
'transcription': 'దశావతారాలలో అత్యంత ప్రముఖమైనటువంటి...'
}
```
### Loading the Dataset
Because of the Parquet optimization, you can stream or load this dataset instantly in Python:
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset", split="train")
# Access the first sample
sample = dataset[0]
print("Text:", sample["transcription"])
print("Audio Array:", sample["audio"]["array"])
```
## Dataset Creation
### Curation Rationale
There is a significant lack of high-quality, publicly available Telugu speech datasets — especially from a consistent, clear single speaker. Sri Chaganti Koteswara Rao's discourses are widely available on YouTube with auto-generated captions, making them a strong candidate for building an aligned Telugu speech corpus for ASR/TTS research.
### Source Data
#### Data Collection and Processing
1. Audio was collected from publicly available YouTube videos of Sri Chaganti Koteswara Rao
2. Transcriptions were sourced from YouTube's auto-generated Telugu captions
3. Audio was precisely chunked to align with transcript segments (~25–40 seconds per clip)
4. Silence was trimmed from each clip using automated VAD (Voice Activity Detection)
**Total audio duration across all clips: +5 hours**
**Source Videos:**
| # | YouTube URL |
|---|---|
| 1 | [https://youtu.be/3G1lu0h7vz8](https://youtu.be/3G1lu0h7vz8) |
| 2 | [https://youtu.be/MAeew4uIHs4](https://youtu.be/MAeew4uIHs4) |
| 3 | [https://youtu.be/4h1R32q3bC8](https://youtu.be/4h1R32q3bC8) |
| 4 | [https://youtu.be/TEO3wNtQswk](https://youtu.be/TEO3wNtQswk) |
| 5 | [https://youtu.be/iz_OilT9hAA](https://youtu.be/iz_OilT9hAA) |
| 6 | [https://youtu.be/pdwNwdxqbcI](https://youtu.be/pdwNwdxqbcI) |
#### Who are the source data producers?
The original audio content was created and delivered by **Sri Chaganti Koteswara Rao**, a renowned Telugu scholar and spiritual speaker. The videos are published by **Sri Chaganti Media** on YouTube.
### Annotations
#### Annotation process
Transcriptions were obtained from YouTube's auto-generated Telugu caption system and aligned to audio chunks programmatically. No manual correction has been applied.
#### Who are the annotators?
Auto-generated by YouTube's speech recognition system. No human annotators were involved.
#### Personal and Sensitive Information
The dataset contains the voice and speech of a public figure (Sri Chaganti Koteswara Rao) delivering public spiritual discourses. No private, sensitive, or personally identifiable information is present beyond the speaker's publicly shared content.
## Bias, Risks, and Limitations
- Transcriptions are **auto-generated** and may contain errors in Telugu text, especially for Sanskrit shlokas and proper nouns
- All audio is from a **single male speaker** — models trained on this data will not generalize to other speakers or genders
- Dataset contains approximately **+5 hours of audio** (778 utterances) — highly suitable for fine-tuning but not for training large-scale models entirely from scratch
- Speaker accent and vocabulary are domain-specific (spiritual/religious discourse)
### Recommendations
- This dataset is best used for **fine-tuning** pre-trained models
- Transcription quality should be verified before use in high-accuracy ASR pipelines
- Models trained on this data should not be used for commercial voice synthesis or impersonation
## Citation
**BibTeX:**
```bibtex
@dataset{chaganti_voice_cc_2026,
author = {nikhilsaipagidimarri},
title = {Chaganti Voice with CC Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset},
license = {CC BY-NC 4.0}
}
```
**APA:**
nikhilsaipagidimarri. (2026). *Chaganti Voice with CC Dataset* [Dataset]. Hugging Face. [https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset](https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset)
## Glossary
- **ASR** — Automatic Speech Recognition: converting spoken audio to text
- **TTS** — Text-to-Speech: converting text to spoken audio
- **Pravachanam** — A Telugu/Sanskrit term for a spiritual discourse or lecture
- **VAD** — Voice Activity Detection: automated removal of silence from audio
## More Information
For questions, issues, or contributions, please open a discussion on the [dataset repository](https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset/discussions).
## Dataset Card Authors
[nikhilsaipagidimarri](https://huggingface.co/nikhilsaipagidimarri)
## Dataset Card Contact
Reach out via the Hugging Face discussion tab on this dataset's repository page.
## Acknowledgements
All credit for the original discourses goes to **Sri Chaganti Koteswara Rao** and **Sri Chaganti Media** for making these spiritual teachings publicly available on YouTube.
提供机构:
nikhilsaipagidimarri



