mbenco/slovak-sft
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mbenco/slovak-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sk
license: cc-by-4.0
task_categories:
- text-generation
tags:
- slovak
- sft
- instruction-following
- chat
pretty_name: Slovak SFT Dataset
size_categories:
- 10K<n<100K
---
# Slovak SFT Dataset
A supervised fine-tuning (SFT) dataset for Slovak language instruction following, constructed from two publicly available Slovak resources:
- [saillab/alpaca-slovak-cleaned](https://huggingface.co/datasets/saillab/alpaca-slovak-cleaned) — Slovak instruction-response pairs
- [TUKE-DeutscheTelekom/skquad](https://huggingface.co/datasets/TUKE-DeutscheTelekom/skquad) — Slovak question answering, rewritten into chat-style prompts
## Format
Each example follows the standard `messages` format with three turns:
```json
{
"messages": [
{"role": "system", "content": "Si užitočný slovenský asistent. Odpovedaj stručne, presne a po slovensky."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
```
## Splits
| Split | File | Examples |
|-------|------|----------|
| train (full) | `slovak_sft_train.jsonl` | 29,962 |
| train 1k | `slovak_sft_train_1k.jsonl` | 1,000 |
| train 5k | `slovak_sft_train_5k.jsonl` | 5,000 |
| train 10k | `slovak_sft_train_10k.jsonl` | 10,000 |
| train 15k | `slovak_sft_train_15k.jsonl` | 15,000 |
| train 20k | `slovak_sft_train_20k.jsonl` | 20,000 |
| validation | `slovak_sft_val.jsonl` | 1,576 |
Smaller subsets are deterministic prefixes of the full training split (shuffled with seed 42), enabling direct scaling comparisons.
## Construction Pipeline
1. **Normalization** — removed `<think>` reasoning traces, collapsed whitespace
2. **Deduplication** — removed duplicate prompt-response pairs
3. **Quality filtering** — length constraints (user: 12–1800 chars, assistant: 12–2200 chars), heuristic low-quality pattern rejection, Slovak lexical marker check
4. **Shuffle** — fixed seed (42) for deterministic train/validation split
5. **Split** — 95% train / 5% validation
## Usage
```python
from datasets import load_dataset
ds = load_dataset("mbenco/slovak-sft")
```
## Citation
If you use this dataset, please cite the paper (forthcoming).
提供机构:
mbenco



