OmAlve/vaarta-cpt-dataset
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OmAlve/vaarta-cpt-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
language:
- mr
- en
- hi
tags:
- marathi
- pretraining
- cpt
- multilingual
- devanagari
- romanized
size_categories:
- 100K<n<1M
---
# Vaarta CPT Dataset
Multilingual continued pretraining corpus used to train the [Vaarta](https://huggingface.co/OmAlve/vaarta-llama-v2)
family of Marathi-first language models. Contains ~320K documents across 6 sources and 3 scripts
(Devanagari, Roman/Latin, English).
## Dataset Composition
| Source | Language | Script | Size | Description |
|--------|----------|--------|------|-------------|
| `marathi_wikipedia` | Marathi | Devanagari | ~90K | Full Marathi Wikipedia dump |
| `sangraha_verified` | Marathi | Devanagari | 80K | Curated high-quality Marathi web text |
| `sangraha_synthetic_deva` | Marathi | Devanagari | 50K | Synthetic Marathi Devanagari text |
| `sangraha_synthetic_roman` | Marathi | Latin/Roman | 30K | Romanized Marathi (e.g. "shivaji maharaj") |
| `english_wikipedia` | English | Latin | 50K | English Wikipedia for multilingual capability |
| `hindi_wikipedia` | Hindi | Devanagari | 20K | Hindi Wikipedia — shares Devanagari with Marathi |
**Total: ~320K documents, shuffled**
## Schema
```python
{
"text": str, # document content (capped at 8K chars)
"source": str, # one of the 6 source keys above
"language": str, # "mr", "mr_roman", "en", "hi"
}
```
## Usage
```python
from datasets import load_dataset
ds = load_dataset("OmAlve/vaarta-cpt-dataset", split="train")
# Filter by language
marathi_only = ds.filter(lambda x: x["language"] == "mr")
roman_marathi = ds.filter(lambda x: x["language"] == "mr_roman")
english_only = ds.filter(lambda x: x["language"] == "en")
# Filter by source
wikipedia_mr = ds.filter(lambda x: x["source"] == "marathi_wikipedia")
```
## Why This Corpus?
Standard LLM pretraining on English-dominant corpora results in poor Marathi performance.
This corpus makes Marathi the dominant language (~78% of documents) while retaining English
and Hindi to prevent catastrophic forgetting of multilingual capability.
Including `mr_roman` (Romanized Marathi) is critical — many Marathi speakers type in Roman
script (e.g. "shivaji maharaj kon hote?"), and without pretraining exposure, models fail to
understand or generate it.
## Training Context
This dataset was used for Stage 1 (CPT) of the Vaarta v2 training pipeline:
- **Stage 1 — CPT**: 6000 steps, batch=8×accum=2, seq_len=1024, LR=2e-4 cosine
- **Stage 2 — SFT**: ~178K instruction examples (see `OmAlve/vaarta-sft-dataset`)
- **Base model**: `meta-llama/Llama-3.2-3B`
- **Final model**: [`OmAlve/vaarta-llama-v2`](https://huggingface.co/OmAlve/vaarta-llama-v2)
## Source Licenses
Original licenses apply per source:
- Wikipedia (all languages): [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
- Sangraha: CC0 / CC BY — see [ai4bharat/sangraha](https://huggingface.co/datasets/ai4bharat/sangraha)
许可证:其他
语言:
- mr(马拉地语)
- en(英语)
- hi(印地语)
标签:
- 马拉地语
- 预训练
- 持续预训练(Continued Pretraining,CPT)
- 多语言
- 天城文
- 罗马化
样本规模区间:100000 < 样本数 < 1000000
# Vaarta CPT 数据集
本多语言持续预训练语料库用于训练以马拉地语为优先的语言模型系列[Vaarta](https://huggingface.co/OmAlve/vaarta-llama-v2)。本数据集涵盖6个数据源与3种书写系统(天城文、罗马/拉丁文字、英文),总计约32万份文档。
## 数据集构成
| 数据源标识 | 语言 | 书写系统 | 规模 | 描述 |
|--------|----------|--------|------|-------------|
| `marathi_wikipedia` | 马拉地语 | 天城文 | 约9万份 | 完整马拉地语维基百科转储 |
| `sangraha_verified` | 马拉地语 | 天城文 | 8万份 | 精选高质量马拉地语网络文本 |
| `sangraha_synthetic_deva` | 马拉地语 | 天城文 | 5万份 | 合成马拉地语天城文文本 |
| `sangraha_synthetic_roman` | 马拉地语 | 拉丁/罗马文字 | 3万份 | 罗马化马拉地语(例如"shivaji maharaj") |
| `english_wikipedia` | 英语 | 拉丁文字 | 5万份 | 用于提升多语言能力的英语维基百科 |
| `hindi_wikipedia` | 印地语 | 天城文 | 2万份 | 印地语维基百科——与马拉地语共享天城文书写系统 |
**总计:约32万份文档,已随机打乱**
## 数据结构
python
{
"text": str, # 文档内容(字符上限为8000)
"source": str, # 上述6个数据源标识之一
"language": str, # 取值为 "mr"、"mr_roman"、"en" 或 "hi"
}
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("OmAlve/vaarta-cpt-dataset", split="train")
# 按语言筛选数据集
marathi_only = ds.filter(lambda x: x["language"] == "mr")
roman_marathi = ds.filter(lambda x: x["language"] == "mr_roman")
english_only = ds.filter(lambda x: x["language"] == "en")
# 按数据源筛选数据集
wikipedia_mr = ds.filter(lambda x: x["source"] == "marathi_wikipedia")
## 语料库构建初衷
以英语为主的语料库进行标准大语言模型(Large Language Model,LLM)预训练,会导致马拉地语模型性能不佳。本语料库将马拉地语设为核心语言(占总文档数约78%),同时保留英语与印地语数据,以避免多语言能力出现灾难性遗忘。
纳入`mr_roman`(罗马化马拉地语)语料至关重要——众多马拉地语使用者使用罗马字母输入(例如"shivaji maharaj kon hote?"),若模型未在预训练阶段接触此类数据,将无法理解或生成罗马化马拉地语文本。
## 训练上下文
本数据集用于Vaarta v2训练流程的第一阶段(持续预训练,Continued Pretraining,CPT):
- **阶段1 — 持续预训练(CPT)**:6000步训练,批次大小=8×梯度累积=2,序列长度=1024,学习率=2e-4,采用余弦退火学习率调度
- **阶段2 — 监督微调(Supervised Fine-Tuning,SFT)**:约17.8万条指令示例(详见`OmAlve/vaarta-sft-dataset`)
- **基础模型**:`meta-llama/Llama-3.2-3B`
- **最终模型**:[`OmAlve/vaarta-llama-v2`](https://huggingface.co/OmAlve/vaarta-llama-v2)
## 数据源许可证
各数据源遵循其原始许可证:
- 维基百科(所有语言):[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
- Sangraha:CC0 / CC BY — 详见[ai4bharat/sangraha](https://huggingface.co/datasets/ai4bharat/sangraha)
提供机构:
OmAlve



