five

j0no12/unified-reasoning-dataset

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/j0no12/unified-reasoning-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: thinking dtype: string - name: instruction dtype: string - name: response dtype: string - name: source dtype: string splits: - name: train num_bytes: 106820084 num_examples: 94860 download_size: 53728052 dataset_size: 106820084 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering - text-generation language: - en license: apache-2.0 tags: - reasoning - chain-of-thought - instruction-tuning - math - distillation - synthetic size_categories: - 10K<n<100K --- # j0no12/unified-reasoning-dataset ## Dataset Details ### Dataset Description A unified, normalized reasoning dataset assembled from four high-quality sources. Each source was cleaned, schema-harmonized, quality-filtered (min 10 chars per field), and globally deduplicated on `(instruction, response)` pairs before merging. The result is a single 94,860-row dataset with consistent `instruction / thinking / response` structure, suitable for supervised fine-tuning of reasoning-capable language models. - **Curated by:** j0no12 - **Language(s):** English - **License:** Apache-2.0 (most restrictive license across sources) ### Dataset Sources | Source | Repo | License | Approx. Rows | |---|---|---|---| | Opus-4.6-Reasoning (cleaned) | [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | Apache-2.0 | ~2,160 | | Qwen3.5-Reasoning-700x | [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | Apache-2.0 | ~700 | | GPT-5.4-Step-by-Step-Reasoning | [Roman1111111/gpt-5.4-step-by-step-reasoning](https://huggingface.co/datasets/Roman1111111/gpt-5.4-step-by-step-reasoning) | MIT | ~1,500 | | Crow-8B Training Data (cleaned) | [Crownelius/Crow-8B-Training-Data-Clean](https://huggingface.co/datasets/Crownelius/Crow-8B-Training-Data-Clean) | — | ~91,000 | ## Uses ### Direct Use This dataset is intended for supervised fine-tuning (SFT) of language models to improve step-by-step reasoning, chain-of-thought generation, and instruction-following. It pairs naturally with models in the 1B–27B range such as Qwen3.5 (4B, 9B, 27B), GPT-OSS 20B, and similar reasoning-capable architectures. ### Out-of-Scope Use This dataset should not be used for RLHF reward modeling without additional preference labels. It is not suitable for factual knowledge retrieval tasks, as responses are synthetically generated and may contain reasoning errors. Do not use for safety-critical applications without further human review. ## Dataset Structure Each row contains four string fields: | Field | Description | |---|---| | `instruction` | The problem, question, or prompt given to the model | | `thinking` | The chain-of-thought / internal reasoning trace (may be empty for sources that did not include one) | | `response` | The final answer or solution | | `source` | Which source dataset the row originated from (`opus_reasoning`, `qwen_reasoning`, `gpt_reasoning`, `crow_8b`) | The `thinking` field is populated for rows from `opus_reasoning` and any sources with explicit CoT traces. For rows where no reasoning trace was available in the source, `thinking` is an empty string. ## Dataset Creation ### Curation Rationale Each of the four source datasets provides a distinct flavor of reasoning data — from Claude Opus-style long-form problem solving, to Qwen3.5 teacher-distilled CoT, to GPT-style step-by-step logic, to a large general instruction corpus. Merging them creates broader coverage across reasoning styles, difficulty levels, and domains while keeping a consistent schema that eliminates preprocessing friction during fine-tuning. ### Data Collection and Processing 1. **Loading** — All four datasets loaded via the HuggingFace `datasets` library, `train` split. 2. **Normalization** — Each dataset mapped to a unified `{instruction, thinking, response, source}` schema. Fields were matched by priority lookup across common naming conventions (`problem`/`input`/`question` → `instruction`, `solution`/`output`/`answer` → `response`, etc.). Datasets with a `messages`/`conversations` list format were parsed by role. 3. **Quality filtering** — Rows with `instruction` or `response` shorter than 10 characters after stripping were removed. 4. **Deduplication** — Global deduplication on the `(instruction, response)` pair across all sources to remove cross-dataset overlap. 5. **Output** — Final dataset serialized to Parquet and pushed via `push_to_hub`. ### Source Data Producers - **Crownelius/Opus-4.6-Reasoning-3300x** — Synthetically generated using Claude Opus 4.6; pre-cleaned to remove refusals and low-quality completions. - **Jackrong/Qwen3.5-reasoning-700x** — Distilled from Qwen3.5-27B (full-parameter) via Alibaba Cloud DashScope, seeded from Alibaba-Superior-Reasoning-Stage2 instructions. Covers math, logic, and general QA with long CoT traces. - **Roman1111111/gpt-5.4-step-by-step-reasoning** — Ultra-high-density synthetic reasoning corpus generated with GPT-5.4 step-by-step prompting. Best suited for fine-tuning 2B–20B models. - **Crownelius/Crow-8B-Training-Data-Clean** — Large general instruction dataset (~615K completion tokens, ~91K rows) generated via OpenRouter at ~$3 total cost; average 1 turn per example. ### Personal and Sensitive Information All data is synthetically generated. No personally identifiable information (PII) is known to be present. Users should nonetheless apply their own PII screening if deploying in sensitive contexts. ## Bias, Risks, and Limitations - **Synthetic origin** — All four sources are AI-generated. Reasoning chains may contain subtle logical errors, hallucinated facts, or stylistic artifacts from the teacher model. - **English only** — All content is in English. Performance on multilingual fine-tuning is untested. - **Math/logic skew** — Qwen and GPT sources skew toward mathematical and logical reasoning. General instruction coverage is primarily provided by the Crow-8B source. - **Empty `thinking` fields** — A significant portion of rows (primarily from `crow_8b`) have no reasoning trace. Practitioners training with a thinking loss mask should filter or weight accordingly. ### Recommendations When fine-tuning, consider masking the loss on the `thinking` field for rows where it is empty rather than training on empty strings. For reasoning-focused training, filter to rows where `len(thinking) > 50` to ensure the model learns from substantive CoT traces. ## Citation If you use this dataset, please also credit the original source datasets: @dataset{crownelius_opus_reasoning_2026, title = {Opus-4.6-Reasoning-3300x (Cleaned)}, author = {Crownelius}, year = {2026}, url = {https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x} } @dataset{jackrong_qwen_reasoning_2026, title = {Qwen3.5-reasoning-700x}, author = {Jackrong}, year = {2026}, url = {https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x} } @dataset{roman_gpt_reasoning_2026, title = {GPT-5.4-Step-by-Step-Reasoning}, author = {Roman1111111}, year = {2026}, url = {https://huggingface.co/datasets/Roman1111111/gpt-5.4-step-by-step-reasoning} } @dataset{crownelius_crow8b_2026, title = {Crow-8B Training Data (Cleaned)}, author = {Crownelius}, year = {2026}, url = {https://huggingface.co/datasets/Crownelius/Crow-8B-Training-Data-Clean} } ## Dataset Card Authors j0no12 ## Dataset Card Contact https://huggingface.co/j0no12
提供机构:
j0no12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作