Mitchins/gruk-sft
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Mitchins/gruk-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- caveman
- instruction-tuning
- sft
- gruk
- small-model
- dialect
pretty_name: Gruk SFT Dataset
size_categories:
- 100K<n<1M
---
# Gruk SFT Dataset v1.0
Instruction-tuning dataset for **Gruk** — a 250M-parameter caveman-dialect LLM.
**Model**: [Mitchins/gruk-250m](https://huggingface.co/Mitchins/gruk-250m)
**Demo**: [Mitchins/gruk (Space)](https://huggingface.co/spaces/Mitchins/gruk)
---
## What is this?
This is the SFT (supervised fine-tuning) dataset that shaped Gruk's voice and behavior.
All responses are written in compressed caveman-dialect English:
> *"sky blue. sun light scatter. blue wave short. go everywhere. red go straight."*
The style is intentional — semantic compression + consistent persona, not random garbling.
---
## Dataset composition
**Total: 247,266 examples** across 3 component sets:
| Split | Examples | Description |
|-------|----------|-------------|
| `sft_v2` | 217,614 | Main expansion: SE-tech, SE-normal, Dolly factual, synthetic narrative/negation/open_weird |
| `sft_patch_v1` | 15,346 | Targeted semantic patch: reasoning, compression, semantic constraint examples |
| `sft_disc_v1` | 14,306 | Discipline examples: arithmetic, logical constraints, comparison, factual precision |
**Top buckets:**
| Bucket | Count |
|--------|-------|
| knowledge_technical | 66,124 |
| normal_english | 43,357 |
| instruct_existing | 37,048 |
| negation | 19,964 |
| open_weird | 19,628 |
| reasoning | 18,548 |
| narrative | 13,892 |
| factual | 12,480 |
| compression | 9,820 |
| semantic | 6,270 |
---
## Format
Each row is JSON with these fields:
```json
{
"instruction": "Why is the sky blue?",
"response": "sky blue. sun light scatter. blue wave short. go everywhere.",
"type": "instruct",
"source": "stackexchange",
"bucket": "knowledge_technical",
"dataset_split": "sft_v2"
}
```
| Field | Description |
|-------|-------------|
| `instruction` | Input prompt (plain English) |
| `response` | Caveman-style answer |
| `type` | `instruct` or `reasoning` (THINK/SAY format) |
| `source` | Origin: `stackexchange`, `alpaca`, `dolly`, `synthetic`, etc. |
| `bucket` | Training category |
| `dataset_split` | Which component set it came from |
---
## Reasoning format (THINK/SAY)
~20% of examples use a structured reasoning format:
```
THINK:
water need oxygen and hydrogen
if no hydrogen, no water
no water, no life
SAY:
no life
```
---
## Data sources
- **StackExchange** (tech + normal) — filtered, converted to QA pairs
- **Alpaca** (Stanford) — instruction following
- **Dolly** (Databricks) — factual QA
- **Synthetic** — generated narrative, negation, constraint, open_weird examples
- **Discipline** — arithmetic, logical constraint, comparison tasks
All responses gruntified (rewritten to caveman style) using Qwen3.5-4B via vLLM.
---
## Training recipe
The model was trained in stages:
1. 250M LLaMA-3.2 pretrain on gruntified Wikipedia + TinyStories (~936k paragraphs)
2. SFT on `sft_v2` (expanded) — produced `gruk-250m-v3-sft-expanded`
3. Targeted patch on `sft_patch_v1` — produced `gruk-250m-v3-sft-patch-v1` (champion)
4. `sft_disc_v1` was a separate discipline branch experiment
**Champion**: `gruk-250m-v3-sft-patch-v1` → scorecard 0.869 overall, demo eval 0.806 (GPU bfloat16)
---
## Version history
| Version | Date | Notes |
|---------|------|-------|
| v1.0 | 2025-04 | Initial public release. 247k examples. |
| v1.1 | planned | +120k targeted examples: normal_english, code, exact_math, factual, identity, howto |
---
## Citation / License
MIT License. Built by Mitch as a distillation/compression experiment.
If you use this dataset, a mention is appreciated but not required.
提供机构:
Mitchins



