sreerag/svara-indic-curriculum-tokenized

Name: sreerag/svara-indic-curriculum-tokenized
Creator: sreerag
Published: 2026-04-17 07:33:29
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/sreerag/svara-indic-curriculum-tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ml - hi tags: - tts - text-normalization - curriculum-learning - neucodec configs: - config_name: default data_files: - split: train path: data/train-*.parquet --- # svara-indic-curriculum-tokenized Malayalam + Hindi TTS dataset with text normalization (TN+TTS format), structured for curriculum learning. ## Stats - **Total:** 3,430 records - **Malayalam:** 1,754 samples - **Hindi:** 1,676 samples ## Curriculum Difficulty | Tier | Count | Categories | |------|-------|-----------| | 1 — Easy | 1,025 | Simple cardinals, clean prose | | 2 — Medium | 1,445 | Currency, units, ordinals, time | | 3 — Hard | 960 | Dates, phone numbers, mixed, complex | ## Format Each record contains tokenized sequences in the Svara TN+TTS format: ``` <|tts|> <start_text> {speaker}: {raw_text} <end_text> ← masked <think> {normalised_text} </think> ← TN loss <start_audio> {codes} <end_audio> ← TTS loss ``` ## Columns - `input_ids` — full token sequence - `labels` — -100 masked prefix, loss on think+audio - `attention_mask` - `seq_len` - `input_text` — raw text with digits/symbols - `text` — normalized text (spelled out) - `speaker_id` — voice name - `language` — ml or hi - `difficulty` — curriculum tier (1/2/3)

提供机构：

sreerag

5,000+

优质数据集

54 个

任务类型

进入经典数据集