five

Mitchins/gruk-sft

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Mitchins/gruk-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en tags: - caveman - instruction-tuning - sft - gruk - small-model - dialect pretty_name: Gruk SFT Dataset size_categories: - 100K<n<1M --- # Gruk SFT Dataset v1.0 Instruction-tuning dataset for **Gruk** — a 250M-parameter caveman-dialect LLM. **Model**: [Mitchins/gruk-250m](https://huggingface.co/Mitchins/gruk-250m) **Demo**: [Mitchins/gruk (Space)](https://huggingface.co/spaces/Mitchins/gruk) --- ## What is this? This is the SFT (supervised fine-tuning) dataset that shaped Gruk's voice and behavior. All responses are written in compressed caveman-dialect English: > *"sky blue. sun light scatter. blue wave short. go everywhere. red go straight."* The style is intentional — semantic compression + consistent persona, not random garbling. --- ## Dataset composition **Total: 247,266 examples** across 3 component sets: | Split | Examples | Description | |-------|----------|-------------| | `sft_v2` | 217,614 | Main expansion: SE-tech, SE-normal, Dolly factual, synthetic narrative/negation/open_weird | | `sft_patch_v1` | 15,346 | Targeted semantic patch: reasoning, compression, semantic constraint examples | | `sft_disc_v1` | 14,306 | Discipline examples: arithmetic, logical constraints, comparison, factual precision | **Top buckets:** | Bucket | Count | |--------|-------| | knowledge_technical | 66,124 | | normal_english | 43,357 | | instruct_existing | 37,048 | | negation | 19,964 | | open_weird | 19,628 | | reasoning | 18,548 | | narrative | 13,892 | | factual | 12,480 | | compression | 9,820 | | semantic | 6,270 | --- ## Format Each row is JSON with these fields: ```json { "instruction": "Why is the sky blue?", "response": "sky blue. sun light scatter. blue wave short. go everywhere.", "type": "instruct", "source": "stackexchange", "bucket": "knowledge_technical", "dataset_split": "sft_v2" } ``` | Field | Description | |-------|-------------| | `instruction` | Input prompt (plain English) | | `response` | Caveman-style answer | | `type` | `instruct` or `reasoning` (THINK/SAY format) | | `source` | Origin: `stackexchange`, `alpaca`, `dolly`, `synthetic`, etc. | | `bucket` | Training category | | `dataset_split` | Which component set it came from | --- ## Reasoning format (THINK/SAY) ~20% of examples use a structured reasoning format: ``` THINK: water need oxygen and hydrogen if no hydrogen, no water no water, no life SAY: no life ``` --- ## Data sources - **StackExchange** (tech + normal) — filtered, converted to QA pairs - **Alpaca** (Stanford) — instruction following - **Dolly** (Databricks) — factual QA - **Synthetic** — generated narrative, negation, constraint, open_weird examples - **Discipline** — arithmetic, logical constraint, comparison tasks All responses gruntified (rewritten to caveman style) using Qwen3.5-4B via vLLM. --- ## Training recipe The model was trained in stages: 1. 250M LLaMA-3.2 pretrain on gruntified Wikipedia + TinyStories (~936k paragraphs) 2. SFT on `sft_v2` (expanded) — produced `gruk-250m-v3-sft-expanded` 3. Targeted patch on `sft_patch_v1` — produced `gruk-250m-v3-sft-patch-v1` (champion) 4. `sft_disc_v1` was a separate discipline branch experiment **Champion**: `gruk-250m-v3-sft-patch-v1` → scorecard 0.869 overall, demo eval 0.806 (GPU bfloat16) --- ## Version history | Version | Date | Notes | |---------|------|-------| | v1.0 | 2025-04 | Initial public release. 247k examples. | | v1.1 | planned | +120k targeted examples: normal_english, code, exact_math, factual, identity, howto | --- ## Citation / License MIT License. Built by Mitch as a distillation/compression experiment. If you use this dataset, a mention is appreciated but not required.
提供机构:
Mitchins
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作