five

Farseen0/opus-4.6-reasoning-sft-12k

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Farseen0/opus-4.6-reasoning-sft-12k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 tags: - reasoning - chain-of-thought - distillation - sft - claude-opus - math - logic size_categories: - 10K<n<100K task_categories: - text-generation pretty_name: Opus 4.6 Reasoning SFT 12k --- # Opus 4.6 Reasoning SFT 12k A unified, pre-cleaned reasoning dataset built from 4 Claude Opus 4.6 distillation sources. Ready for supervised fine-tuning — just load and train. ## Why This Dataset Exists The source datasets have different schemas, null values, and reasoning stored in non-standard keys that `apply_chat_template()` silently drops. This dataset fixes all of that: - Reasoning traces merged into assistant content using `<think>...</think>` tags - Null/empty content handled (broken samples dropped) - All schemas unified to standard `messages` format - Single dataset load replaces 4 separate downloads + complex formatting logic ## Quick Start ```python from datasets import load_dataset dataset = load_dataset("Farseen0/opus-4.6-reasoning-sft-12k", split="train") print(f"{len(dataset)} samples ready for training") ``` ### With TRL / Unsloth ```python from datasets import load_dataset from trl import SFTTrainer, SFTConfig dataset = load_dataset("Farseen0/opus-4.6-reasoning-sft-12k", split="train") # Apply your model's chat template def format_to_text(example): text = tokenizer.apply_chat_template( example["messages"], tokenize=False, add_generation_prompt=False, ) return {"text": text} dataset = dataset.map(format_to_text, num_proc=8, remove_columns=dataset.column_names) trainer = SFTTrainer( model=model, train_dataset=dataset, dataset_text_field="text", args=SFTConfig(output_dir="output", num_train_epochs=2), ) trainer.train() ``` ## Dataset Details | Stat | Value | |------|-------| | Total samples | 12,929 | | With reasoning (`<think>` tags) | 12,611 (97.5%) | | Without reasoning | 318 (2.5%) | | Format | `messages` (list of `{role, content}` dicts) | | Roles | `user`, `assistant` | | Messages per sample | 2 (single-turn) | | Content length p50 | 790 chars | | Content length p90 | 4,310 chars | | Content length p99 | 30,802 chars | | Max content length | 62,560 chars | ## Sources Built from 4 public datasets, all generated using Claude Opus 4.6: | Source | Kept | Dropped | Reason for drops | License | |--------|------|---------|------------------|---------| | [Roman1111111/claude-opus-4.6-10000x](https://huggingface.co/datasets/Roman1111111/claude-opus-4.6-10000x) | 9,632 | 1 | Null content | MIT | | [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | 2,151 | 9 | Problem text < 10 chars | Apache 2.0 | | [TeichAI/Claude-Opus-4.6-Reasoning-887x](https://huggingface.co/datasets/TeichAI/Claude-Opus-4.6-Reasoning-887x) | 886 | 0 | — | Apache 2.0 | | [crownelius/Opus4.6-No-Reasoning-260x](https://huggingface.co/datasets/crownelius/Opus4.6-No-Reasoning-260x) | 260 | 0 | — | Apache 2.0 | ### What was fixed per source **Roman1111111** — Reasoning was stored as a separate `reasoning` key on message dicts (not in `content`). `apply_chat_template()` silently ignores this key, so all reasoning traces would be lost without preprocessing. Also had 1 sample with `content: null`. Generic system prompt removed. **Crownelius/Reasoning** — Flat format (`problem`/`thinking`/`solution` columns, not `messages`). Converted to chat format with thinking embedded as `<think>` tags. 9 samples dropped for broken/truncated problems under 10 characters. **TeichAI** — Messages had `thinking` and `name` keys alongside `content`. The `thinking` key is silently dropped by chat templates. 58 samples had `thinking: null` (direct answers without CoT) — kept as-is. System prompt removed. **Crownelius/No-Reasoning** — Flat format (`original_question`/`response`). Converted to chat format. No reasoning traces by design — these provide general assistant balance to prevent over-reliance on chain-of-thought. ## Schema ```json { "messages": [ {"role": "user", "content": "What is 3x + 7 = 22? Solve for x."}, {"role": "assistant", "content": "<think>\nI need to solve for x.\n3x + 7 = 22\n3x = 15\nx = 5\n</think>\n\nx = 5. Subtracting 7 from both sides gives 3x = 15, dividing by 3 gives x = 5."} ], "source": "roman", "has_reasoning": true } ``` ### Columns | Column | Type | Description | |--------|------|-------------| | `messages` | `list[{role: str, content: str}]` | Standard chat format, ready for `apply_chat_template()` | | `source` | `str` | Origin dataset: `roman`, `crownelius_reasoning`, `teichai`, `crownelius_no_reasoning` | | `has_reasoning` | `bool` | Whether the assistant response contains `<think>` tags | ## Content Categories - **Mathematics** — algebra, calculus, number theory, geometry, GSM8K, MATH (majority of dataset) - **Logic** — puzzles, deductive reasoning, constraint satisfaction - **Programming** — algorithm design, debugging, code optimization - **Science** — physics, chemistry, biology problems - **Bullshit detection** — identifying false claims (from TeichAI's Bullshit Bench) - **General knowledge** — graduate-level topics in topology, cryptography, astrophysics (from No-Reasoning set) ## Model Compatibility This dataset is model-agnostic. It uses standard `messages` format that works with any model's chat template: - **Gemma 4** — `tokenizer.apply_chat_template()` (auto-converts `assistant` to `model` role) - **Qwen 3.5** — Direct compatibility - **Llama 3** — Direct compatibility - **Mistral** — Direct compatibility ## Preprocessing Script The dataset was built using [`scripts/build_dataset.py`](https://github.com/farseenshaikh/gemma4/blob/main/scripts/build_dataset.py). To rebuild or modify: ```bash python scripts/build_dataset.py --push Farseen0/opus-4.6-reasoning-sft-12k ``` ## Acknowledgments This dataset would not exist without the work of the original dataset creators: - **Roman1111111** — [claude-opus-4.6-10000x](https://huggingface.co/datasets/Roman1111111/claude-opus-4.6-10000x) - **Crownelius** — [Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) and [Opus4.6-No-Reasoning-260x](https://huggingface.co/datasets/crownelius/Opus4.6-No-Reasoning-260x) - **TeichAI** — [Claude-Opus-4.6-Reasoning-887x](https://huggingface.co/datasets/TeichAI/Claude-Opus-4.6-Reasoning-887x) ## License Apache 2.0 (most permissive common license across all sources). The Roman1111111 source uses MIT. Please review source dataset pages for full terms.
提供机构:
Farseen0
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作