five

mbenco/slovak-sft

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mbenco/slovak-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - sk license: cc-by-4.0 task_categories: - text-generation tags: - slovak - sft - instruction-following - chat pretty_name: Slovak SFT Dataset size_categories: - 10K<n<100K --- # Slovak SFT Dataset A supervised fine-tuning (SFT) dataset for Slovak language instruction following, constructed from two publicly available Slovak resources: - [saillab/alpaca-slovak-cleaned](https://huggingface.co/datasets/saillab/alpaca-slovak-cleaned) — Slovak instruction-response pairs - [TUKE-DeutscheTelekom/skquad](https://huggingface.co/datasets/TUKE-DeutscheTelekom/skquad) — Slovak question answering, rewritten into chat-style prompts ## Format Each example follows the standard `messages` format with three turns: ```json { "messages": [ {"role": "system", "content": "Si užitočný slovenský asistent. Odpovedaj stručne, presne a po slovensky."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ] } ``` ## Splits | Split | File | Examples | |-------|------|----------| | train (full) | `slovak_sft_train.jsonl` | 29,962 | | train 1k | `slovak_sft_train_1k.jsonl` | 1,000 | | train 5k | `slovak_sft_train_5k.jsonl` | 5,000 | | train 10k | `slovak_sft_train_10k.jsonl` | 10,000 | | train 15k | `slovak_sft_train_15k.jsonl` | 15,000 | | train 20k | `slovak_sft_train_20k.jsonl` | 20,000 | | validation | `slovak_sft_val.jsonl` | 1,576 | Smaller subsets are deterministic prefixes of the full training split (shuffled with seed 42), enabling direct scaling comparisons. ## Construction Pipeline 1. **Normalization** — removed `<think>` reasoning traces, collapsed whitespace 2. **Deduplication** — removed duplicate prompt-response pairs 3. **Quality filtering** — length constraints (user: 12–1800 chars, assistant: 12–2200 chars), heuristic low-quality pattern rejection, Slovak lexical marker check 4. **Shuffle** — fixed seed (42) for deterministic train/validation split 5. **Split** — 95% train / 5% validation ## Usage ```python from datasets import load_dataset ds = load_dataset("mbenco/slovak-sft") ``` ## Citation If you use this dataset, please cite the paper (forthcoming).
提供机构:
mbenco
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作