OmAlve/vaarta-sft-dataset
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OmAlve/vaarta-sft-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
language:
- mr
- en
- hi
tags:
- marathi
- sft
- instruction-tuning
- multilingual
- conversations
- romanized
size_categories:
- 100K<n<1M
---
# Vaarta SFT Dataset
Multilingual supervised fine-tuning dataset used to train the [Vaarta](https://huggingface.co/OmAlve/vaarta-llama-v2)
family of Marathi-first language models. ~178K examples in structured `messages` format —
no pre-applied chat template, apply your own at training time.
## Format
Each example has three fields:
```python
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "शिवाजी महाराज कोण होते?"},
{"role": "assistant", "content": "शिवाजी महाराज हे मराठा साम्राज्याचे संस्थापक होते..."}
],
"source": "k12_conversations", # which dataset
"language": "mr" # "mr", "mr_roman", "en", "hi"
}
```
Multi-turn conversations (from k12) have multiple user/assistant turns after the system message.
## Dataset Composition
| Source | Language | Size | Description |
|--------|----------|------|-------------|
| `k12_conversations` | `mr` | ~36K | Native multi-turn Marathi conversations (×4) |
| `k12_conversations_roman` | `mr_roman` | ~16K | Same conversations in Roman script (×2) |
| `alpaca_marathi` | `mr` | ~48K | Full Marathi Alpaca instruction dataset |
| `wikipedia_marathi` | `mr` | ~28K | Wikipedia articles → comprehension QA |
| `targeted_qa` | `mr` | 250 | Hand-curated Maharashtra factual QA (×10) |
| `targeted_qa_roman` | `mr_roman` | 250 | Same facts in Roman script (×10) |
| `openorca_english` | `en` | 10K | English QA from OpenOrca |
| `samanantar_en_mr` | `mr` | 12K | English→Marathi translation pairs |
| `samanantar_mr_en` | `en` | 4K | Marathi→English translation pairs |
| `sangraha_roman` | `mr_roman` | 8K | Romanized Marathi summaries |
| `wikipedia_hindi` | `hi` | 8K | Hindi Wikipedia comprehension QA |
| `aksharantar` | `mr`/`en` | 5K | Roman↔Devanagari transliteration |
**Total: ~175K examples, shuffled**
## Usage
```python
from datasets import load_dataset
ds = load_dataset("OmAlve/vaarta-sft-dataset", split="train")
# Filter by language
marathi = ds.filter(lambda x: x["language"] == "mr")
roman = ds.filter(lambda x: x["language"] == "mr_roman")
# Apply your chat template at training time
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")
def apply_template(example):
return {"text": tokenizer.apply_chat_template(
example["messages"], tokenize=False, add_generation_prompt=False
)}
ds = ds.map(apply_template)
```
## Why `messages` Format?
Storing raw messages (not pre-formatted text) means:
- Compatible with any chat template (`chatml`, `llama`, `gemma`, etc.)
- Easy to filter by role, language, or source
- Works directly with `trl.SFTTrainer` and HuggingFace `apply_chat_template`
## System Prompts
Each example has a randomly sampled system prompt from a pool of 25 English prompts,
ranging from short ("You are a helpful assistant.") to detailed role descriptions
with Maharashtra-specific knowledge. This prevents overfitting to a single prompt.
## Training Context
- **Stage 1 — CPT data**: [`OmAlve/vaarta-cpt-dataset`](https://huggingface.co/datasets/OmAlve/vaarta-cpt-dataset)
- **Final model**: [`OmAlve/vaarta-llama-v2`](https://huggingface.co/OmAlve/vaarta-llama-v2)
- **Base model**: `meta-llama/Llama-3.2-3B`
## Licenses
Original licenses apply per source:
- k12-conversations: see [simonguest/marathi-k12-conversations](https://huggingface.co/datasets/simonguest/marathi-k12-conversations)
- Alpaca Marathi: see [smallstepai/marathi-instruction-tuning-alpaca](https://huggingface.co/datasets/smallstepai/marathi-instruction-tuning-alpaca)
- Wikipedia: CC BY-SA 4.0
- Samanantar: CC BY 4.0
- Sangraha: CC0 / CC BY
- Aksharantar: CC BY / CC0
- OpenOrca: MIT
提供机构:
OmAlve



