sreerag/svara-indic-curriculum-tokenized
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sreerag/svara-indic-curriculum-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ml
- hi
tags:
- tts
- text-normalization
- curriculum-learning
- neucodec
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
---
# svara-indic-curriculum-tokenized
Malayalam + Hindi TTS dataset with text normalization (TN+TTS format), structured for curriculum learning.
## Stats
- **Total:** 3,430 records
- **Malayalam:** 1,754 samples
- **Hindi:** 1,676 samples
## Curriculum Difficulty
| Tier | Count | Categories |
|------|-------|-----------|
| 1 — Easy | 1,025 | Simple cardinals, clean prose |
| 2 — Medium | 1,445 | Currency, units, ordinals, time |
| 3 — Hard | 960 | Dates, phone numbers, mixed, complex |
## Format
Each record contains tokenized sequences in the Svara TN+TTS format:
```
<|tts|> <start_text> {speaker}: {raw_text} <end_text> ← masked
<think> {normalised_text} </think> ← TN loss
<start_audio> {codes} <end_audio> ← TTS loss
```
## Columns
- `input_ids` — full token sequence
- `labels` — -100 masked prefix, loss on think+audio
- `attention_mask`
- `seq_len`
- `input_text` — raw text with digits/symbols
- `text` — normalized text (spelled out)
- `speaker_id` — voice name
- `language` — ml or hi
- `difficulty` — curriculum tier (1/2/3)
提供机构:
sreerag



