five

nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - te pretty_name: Chaganti Voice with CC Dataset task_categories: - automatic-speech-recognition - text-to-speech tags: - telugu - speech - chaganti - tts - asr - pravachanam - voice-cloning size_categories: - n<1K dataset_info: features: - name: audio dtype: audio - name: transcription dtype: string splits: - name: train num_bytes: 4407288414 num_examples: 778 download_size: 4362087261 dataset_size: 4407288414 configs: - config_name: default data_files: - split: train path: train-*.parquet --- # Dataset Card for Chaganti Voice with CC Dataset A curated Telugu speech dataset of Sri Chaganti Koteswara Rao's spiritual discourses (*pravachanams*), paired with auto-generated transcriptions. Built for training or fine-tuning **ASR** and **TTS / Voice Cloning** models in the Telugu language. ## Dataset Details ### Dataset Description This **4.36 GB** dataset contains high-quality audio clips of Sri Chaganti Koteswara Rao's Telugu spiritual discourses, precisely chunked and aligned with auto-generated YouTube transcriptions. Silence has been trimmed from each clip for optimized ML training pipelines. - **Curated by:** [nikhilsaipagidimarri](https://huggingface.co/nikhilsaipagidimarri) - **Funded by:** N/A - **Shared by:** nikhilsaipagidimarri - **Language(s) (NLP):** Telugu (`te`) - **License:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) - **Total Audio Duration:** +5 hours - **Total Size:** 4.36 GB (778 instances) ### Dataset Sources - **Repository:** [nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset](https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset) - **Paper:** N/A - **Demo:** N/A ## Uses ### Direct Use This dataset is intended for: - Training and fine-tuning **Telugu ASR** (Automatic Speech Recognition) models - Training and fine-tuning **Telugu TTS** (Text-to-Speech) models - **Voice cloning** research using a single high-quality Telugu speaker - Academic research on low-resource Telugu speech ### Out-of-Scope Use - **Commercial use** of any kind — prohibited under CC BY-NC 4.0 - Multi-speaker speech tasks — this dataset contains a single speaker only - Speaker diarization or speaker verification tasks - Any use that violates [YouTube's Terms of Service](https://www.youtube.com/t/terms) ## Dataset Structure This dataset has been fully optimized into **Parquet shards**. This means you do not need to download raw `.wav` files manually; the Hugging Face `datasets` library will automatically handle the audio decoding and convert the files into numpy arrays for immediate model training. ### File Layout ```text data/ train-00000-of-00020.parquet train-00001-of-00020.parquet train-00002-of-00020.parquet train-00003-of-00020.parquet train-00004-of-00020.parquet train-00005-of-00020.parquet train-00006-of-00020.parquet train-00007-of-00020.parquet train-00008-of-00020.parquet ... train-00019-of-00020.parquet ``` ### Data Instance Each loaded instance contains: - **`audio`** — Dictionary containing the raw audio waveform array, sampling rate, and reference path. - **`transcription`** — Telugu text of what is spoken in the clip. **Example:** ```python { 'audio': { 'path': 'utt_001.wav', 'array': array([ 0.0000000e+00, -3.0517578e-05, ...]), 'sampling_rate': 16000 }, 'transcription': 'దశావతారాలలో అత్యంత ప్రముఖమైనటువంటి...' } ``` ### Loading the Dataset Because of the Parquet optimization, you can stream or load this dataset instantly in Python: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset", split="train") # Access the first sample sample = dataset[0] print("Text:", sample["transcription"]) print("Audio Array:", sample["audio"]["array"]) ``` ## Dataset Creation ### Curation Rationale There is a significant lack of high-quality, publicly available Telugu speech datasets — especially from a consistent, clear single speaker. Sri Chaganti Koteswara Rao's discourses are widely available on YouTube with auto-generated captions, making them a strong candidate for building an aligned Telugu speech corpus for ASR/TTS research. ### Source Data #### Data Collection and Processing 1. Audio was collected from publicly available YouTube videos of Sri Chaganti Koteswara Rao 2. Transcriptions were sourced from YouTube's auto-generated Telugu captions 3. Audio was precisely chunked to align with transcript segments (~25–40 seconds per clip) 4. Silence was trimmed from each clip using automated VAD (Voice Activity Detection) **Total audio duration across all clips: +5 hours** **Source Videos:** | # | YouTube URL | |---|---| | 1 | [https://youtu.be/3G1lu0h7vz8](https://youtu.be/3G1lu0h7vz8) | | 2 | [https://youtu.be/MAeew4uIHs4](https://youtu.be/MAeew4uIHs4) | | 3 | [https://youtu.be/4h1R32q3bC8](https://youtu.be/4h1R32q3bC8) | | 4 | [https://youtu.be/TEO3wNtQswk](https://youtu.be/TEO3wNtQswk) | | 5 | [https://youtu.be/iz_OilT9hAA](https://youtu.be/iz_OilT9hAA) | | 6 | [https://youtu.be/pdwNwdxqbcI](https://youtu.be/pdwNwdxqbcI) | #### Who are the source data producers? The original audio content was created and delivered by **Sri Chaganti Koteswara Rao**, a renowned Telugu scholar and spiritual speaker. The videos are published by **Sri Chaganti Media** on YouTube. ### Annotations #### Annotation process Transcriptions were obtained from YouTube's auto-generated Telugu caption system and aligned to audio chunks programmatically. No manual correction has been applied. #### Who are the annotators? Auto-generated by YouTube's speech recognition system. No human annotators were involved. #### Personal and Sensitive Information The dataset contains the voice and speech of a public figure (Sri Chaganti Koteswara Rao) delivering public spiritual discourses. No private, sensitive, or personally identifiable information is present beyond the speaker's publicly shared content. ## Bias, Risks, and Limitations - Transcriptions are **auto-generated** and may contain errors in Telugu text, especially for Sanskrit shlokas and proper nouns - All audio is from a **single male speaker** — models trained on this data will not generalize to other speakers or genders - Dataset contains approximately **+5 hours of audio** (778 utterances) — highly suitable for fine-tuning but not for training large-scale models entirely from scratch - Speaker accent and vocabulary are domain-specific (spiritual/religious discourse) ### Recommendations - This dataset is best used for **fine-tuning** pre-trained models - Transcription quality should be verified before use in high-accuracy ASR pipelines - Models trained on this data should not be used for commercial voice synthesis or impersonation ## Citation **BibTeX:** ```bibtex @dataset{chaganti_voice_cc_2026, author = {nikhilsaipagidimarri}, title = {Chaganti Voice with CC Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset}, license = {CC BY-NC 4.0} } ``` **APA:** nikhilsaipagidimarri. (2026). *Chaganti Voice with CC Dataset* [Dataset]. Hugging Face. [https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset](https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset) ## Glossary - **ASR** — Automatic Speech Recognition: converting spoken audio to text - **TTS** — Text-to-Speech: converting text to spoken audio - **Pravachanam** — A Telugu/Sanskrit term for a spiritual discourse or lecture - **VAD** — Voice Activity Detection: automated removal of silence from audio ## More Information For questions, issues, or contributions, please open a discussion on the [dataset repository](https://huggingface.co/datasets/nikhilsaipagidimarri/Chaganti_Voice_with_CC_Dataset/discussions). ## Dataset Card Authors [nikhilsaipagidimarri](https://huggingface.co/nikhilsaipagidimarri) ## Dataset Card Contact Reach out via the Hugging Face discussion tab on this dataset's repository page. ## Acknowledgements All credit for the original discourses goes to **Sri Chaganti Koteswara Rao** and **Sri Chaganti Media** for making these spiritual teachings publicly available on YouTube.
提供机构:
nikhilsaipagidimarri
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作