five

sreerag/svara-indic-curriculum-tokenized

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sreerag/svara-indic-curriculum-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ml - hi tags: - tts - text-normalization - curriculum-learning - neucodec configs: - config_name: default data_files: - split: train path: data/train-*.parquet --- # svara-indic-curriculum-tokenized Malayalam + Hindi TTS dataset with text normalization (TN+TTS format), structured for curriculum learning. ## Stats - **Total:** 3,430 records - **Malayalam:** 1,754 samples - **Hindi:** 1,676 samples ## Curriculum Difficulty | Tier | Count | Categories | |------|-------|-----------| | 1 — Easy | 1,025 | Simple cardinals, clean prose | | 2 — Medium | 1,445 | Currency, units, ordinals, time | | 3 — Hard | 960 | Dates, phone numbers, mixed, complex | ## Format Each record contains tokenized sequences in the Svara TN+TTS format: ``` <|tts|> <start_text> {speaker}: {raw_text} <end_text> ← masked <think> {normalised_text} </think> ← TN loss <start_audio> {codes} <end_audio> ← TTS loss ``` ## Columns - `input_ids` — full token sequence - `labels` — -100 masked prefix, loss on think+audio - `attention_mask` - `seq_len` - `input_text` — raw text with digits/symbols - `text` — normalized text (spelled out) - `speaker_id` — voice name - `language` — ml or hi - `difficulty` — curriculum tier (1/2/3)
提供机构:
sreerag
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作