five

jon7009/SCoRe

收藏
Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jon7009/SCoRe
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - cot - chain-of-thought - reasoning - sft - dpo - instruction-tuning - preference - high-quality - curated - synthetic - english - text-generation size_categories: - 100K<n<1M configs: - config_name: sft data_files: - split: train path: "SCoRE_SFT_FINAL.jsonl" default: true - config_name: alpaca_dpo data_files: - split: train path: "SCoRE.alpaca_dpo.json" - config_name: chatml_dpo data_files: - split: train path: "SCoRE.chatml_dpo.json" - config_name: sharegpt_dpo data_files: - split: train path: "SCoRE.sharegpt_dpo.json" - config_name: trl_dpo data_files: - split: train path: "SCoRE.trl_dpo.json" --- Structured Chain of Reasoning A matrix of 107 reasoning topics across 37 question forms, represented in 115,659 unique questions and 19,921 DPO pairs. Curated from the upper output distribution of GPT-OSS-120B and Qwen3-32B, guided by a curriculum and prompt architecture designed with frontier model LLM assistance. Each record is graded, filtered, and postprocessed to retain only high quality reasoning chains, resulting in a dataset that systematically captures the best reasoning these models can produce across a structured topic*form matrix they would not cover unprompted. This is curated best-of-distribution output, not raw model generation, and not an attempt to exceed the source model's reasoning ceiling. Domain Reasoning frameworks, not math or code. The dominant public CoT-SFT datasets (OpenR1-Math-220k, OpenThoughts3, NuminaMath, PRM800K) are overwhelmingly concentrated in mathematics, formal logic, and code, where answers are mechanically verifiable. General purpose datasets (OpenHermes 2.5, Alpaca, FLAN, Tulu 3, MAGPIE) cover broader ground but provide little or no structured reasoning traces for soft analytical skills: recognizing cognitive biases, applying decision-theoretic frameworks, navigating ethical trade offs, or performing second-order thinking. This dataset covers 107 such concepts spanning cognitive psychology, epistemology, systems thinking, learning science, economics, ethics, and AI alignment with every example containing a full chain-of-thought trace. Pipeline Architecture Each training example is built through a deliberate multi-stage pipeline rather than a single monolithic generation call. The QA pair (question + reference answer) are generated first by GPT-OSS-120B from the topic*form matrix. The reasoning chain is then generated in a separate call, and graded in a third call. By isolating each stage, the model gives full attention to one task at a time, generating a well-formed question, reasoning through it, and evaluating the result, rather than splitting focus across all three in a single prompt. A second reasoning chain is then generated by Qwen3-32B for the original QA pairs, and graded again by GPT-OSS-120B. From the total of five API calls it yields two independently graded CoT responses per question/answer pair, maximizing the quality signal available for both SFT and DPO at the expense of more API calls and compute. Calibrated self-grading All entries are graded by GPT-OSS-120B on a five-criterion rubric (factual accuracy, CoT depth and logic, pedagogical clarity, teaching value, overall SFT usefulness) Each 0–2, total 0–10. The grading model is deliberately only one model, implying that a score of 10 means the material meets or exceeds the grading model's own perceptual ceiling. The grading prompt instructs the model to output only a bare integer, and the score parser applies a multi-pass extraction strategy (exact match, regex extraction, fallback digit scan) to handle occasional formatting noise without ever misinterpreting a score. Opensource While the rubrics, grading, prompts, topics, and question formats all had some help with review and additional metrics or considerations from top-tier closed-source models, none of the pipeline data is synthetically generated from them. EG: They can't do the "teaching", but they can consult in the structure of the curriculum. All of the API calls used to generate this data were through a distillation friendly provider (Groq) with models that have Apache 2.0 licenses and permit distillation. Because this training data set (SCoRe - Structured Chain of Reasoning) is also Apache 2.0 licensed, you can use/modify/distribute this material as long as you reference the three licenses appropriately. SFT and DPO Construction For the SFT dataset, the highest-graded CoT between the two models is retained for each QA pair. For DPO, both responses are available as a preference pair. Because both were graded and only records meeting a quality threshold survive pruning, the rejected output is still competent reasoning. The preference signal is between good and better, not good and bad. This avoids the common DPO pitfall of training on low quality rejected examples that teach the model what bad reasoning looks like rather than how to distinguish adequate reasoning from strong reasoning. DPO files contain extra metadata on accepted/rejected sources and grading. © 2026 Jonathan Dilley. Licensed under the Apache License, Version 2.0.
提供机构:
jon7009
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作