five

MonumentalSystems/polymath-reasoning-v1

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/MonumentalSystems/polymath-reasoning-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en task_categories: - text-generation - question-answering tags: - reasoning - chain-of-thought - cot - thinking-tags - philosophy - ethics - synthetic - dialogue - counterfactual - logic - distillation size_categories: - n<1K pretty_name: Polymath Reasoning Corpus v1 configs: - config_name: default data_files: - split: train path: train.parquet --- # Polymath Reasoning Corpus v1 A multi-domain reasoning corpus that pairs explicit `<thinking>` chain-of-thought with substantive output across **eleven distinct cognitive formats** — chain-of-thought debates, multi-figure salons, counterfactual perspective pieces, logic problems with worked solutions, multi-step reasoning chains, math/science/code explanations, and challenging Q&A. The unifying angle is *how experts actually reason* across disciplines, not just *what they conclude*. Most CoT datasets show one path to one answer. This one shows **strategic reasoning** (what to concede, what to attack, what to leave unsaid), **collaborative synthesis** (3–4 thinkers reaching insights none could alone), and **counterfactual framing** (a 1700s natural philosopher encountering CRISPR for the first time). ## At a glance | | | |---|---| | Rows | 377 | | Words | 820,159 | | `<thinking>` blocks | 1,122 | | Categories | 11 | | Generator | GLM-4.7-flash via Z.AI | | License | CC-BY-4.0 | ### Category breakdown | Category | Rows | Format | |---|---:|---| | `cot_debate` | 95 | Two-figure debate with `<thinking>...</thinking>` before each spoken turn (6–7 exchanges) | | `salon` | 65 | 3–4 historical figures collaboratively exploring a question; multi-phase synthesis | | `science_explanation` | 46 | 700–1200-word deep-dive across 35 topics in bio/physics/chem/geo/neuro | | `perspective` | 31 | A historical figure encounters a modern concept (CRISPR, LLMs, gravitational waves...) | | `qa_pair` | 30 | Challenging cross-domain question with detailed reasoned answer | | `reasoning_chain` | 30 | 8–12 explicit reasoning steps with named alternatives and assumption-testing | | `logic_problem` | 23 | Knights/knaves, induction, probability, set theory — full worked solutions with multi-method verification | | `math_explanation` | 20 | Undergraduate/graduate math with full derivations + worked examples | | `code_walkthrough` | 18 | Working Python/Rust with stepwise explanation + complexity analysis | | `synthesis` | 18 | Cross-domain essays joining 3 disciplines (e.g., thermodynamics × economics × evolution) | | `debate` | 1 | Plain figure-vs-figure debate (legacy format) | ## Schema Each row contains: | Field | Type | Description | |---|---|---| | `id` | string | Source filename (without extension) | | `category` | string | One of the 11 categories above | | `text` | string | Full content | | `speakers` | list[string] | Distinct named speakers (for dialogue formats) | | `n_speakers` | int | `len(speakers)` | | `n_thinking_blocks` | int | Count of `<thinking>...</thinking>` blocks | | `char_count` | int | Character count | | `word_count` | int | Word count | | `source_model` | string | `"GLM-4.7-flash"` | | `source_provider` | string | `"Z.AI"` | ## Purpose and scope Reasoning datasets released after DeepSeek-R1 (Jan 2025) cluster heavily in math, coding, and verifiable science — where there's a single right answer and a clean reward signal. This corpus deliberately covers **underexplored cognitive territory**: - **Ethics & philosophy** — a category Bespoke Labs called out as missing in their 2025 reasoning datasets competition; no winner came from this space - **Counterfactual reasoning** — perspective pieces are literally counterfactual ("what would Aristotle make of CRISPR?") - **Strategic reasoning** — the `<thinking>` blocks in CoT debates include rhetorical choices (what to concede vs. attack) alongside object-level analysis - **Collaborative synthesis** — multi-figure salons demonstrate cross-domain idea combination, not just isolated problem-solving The CoT debate and salon formats are intended to support **reasoning distillation** into smaller models, providing structured cognitive exemplars for the kind of strategic + analytical reasoning that's hard to elicit with simple problem prompts. ## Dataset creation method All content was generated via the [text-pipeline](https://github.com/MonumentalSystems) synthetic generator (`synth_debates.py`) calling **GLM-4.7-flash** through Z.AI. For each category, a prompt template requires the model to: 1. Produce a substantial output (typically 700–2000 words) 2. Embed reasoning *before* outputs in `<thinking>...</thinking>` tags (CoT debates only) 3. Reference *specific* prior work, named experiments, or canonical results — not vague allusion 4. Pressure-test the conclusion (alternative paths, assumptions, what would change the answer) The 84-figure pool of historical thinkers spans 6th-century-BCE China to 20th-century physics (Aristotle, Hypatia, Maxwell, Noether, Mirzakhani, Wu, Du Fu, Murasaki Shikibu, Frederick Douglass, Rachel Carson, Kolmogorov, ...), with a deliberate effort to include non-Western and historically underrepresented contributors. Topics, figures, and synthesis-domain triples are drawn from finite curated pools (50–86 topics depending on category, 84 figures, 18 domain triples) and combined via a shuffled cycle so within a single run no combination repeats before all are seen. Topics drawn from **86 cross-domain prompts** (covering metaphysics, ethics, complexity, language/cognition, technology, history, aesthetics, and underexplored angles like "whether emergence is real or just a description of our ignorance"), **35 deep-science topics**, **30 challenging questions**, **23 logic problems**, **20 math topics**, and **18 code topics**. Full topic lists and prompt templates are in the source script. Generation parameters: `temperature=0.8`, `max_tokens=8192`, `thinking: {"type": "disabled"}` on Z.AI (which is itself a reasoning model — disabled to prevent the model's internal reasoning from consuming the output budget; the `<thinking>` blocks in the dataset are explicit content the model writes per the prompt template). Cleaning: the byte-level cleaner (`cleaner_v2.ByteCleaner`) ran on every file, preserving `<thinking>` tags, LaTeX (`$...$`, `\frac{}{}`), code fences (` ``` `), em-dashes, smart quotes, and Python indentation. No ASCII whitelist is applied — this is a byte-level corpus. ## Example uses - **Reasoning distillation** — fine-tune a small model on `cot_debate` rows where each exchange shows `<thinking>` then spoken response, teaching the small model to "think before speaking" with explicit strategic content - **Ethics & philosophy reasoning evaluation** — build benchmarks from `salon` and `perspective` rows where there is no single correct answer but reasoning quality varies - **Counterfactual reasoning training** — `perspective` rows are concrete counterfactual exercises (constraint: reason from a specific historical epistemic frame about a modern phenomenon) - **Multi-agent dialogue training** — `salon` rows show how 3–4 distinct voices can collaboratively reach a synthesis, useful for multi-agent / debate-based RLAIF - **Logic & math instruction** — `logic_problem` and `math_explanation` rows include multi-method verification (every problem solved by at least two independent approaches), useful for self-consistency training ## Sample structure A typical `cot_debate` row contains pairs like: ``` Ada Lovelace: <thinking>It is always a relief to speak with someone who commands such intellectual rigor; Dr. Wu does not suffer fools or loose abstractions. She speaks of symmetries and physical laws, yet she is too quick to dismiss the structural architecture of sound. To connect them, I must bridge the gap between the discrete nature of numbers and the continuous flow of a melody. I will use the specific example of the Differences Engine to show how her "continuous" universe is actually built from discrete steps.</thinking> "The notion that mathematics and music belong to entirely separate kingdoms—the one cold and logical, the other passionate and imprecise—is a failure of imagination, Madam Wu. ..." ``` A typical `logic_problem` row begins with the problem statement and walks through the full solution with rule citations (e.g., "by modus tollens", "by induction hypothesis") followed by independent verification. ## Limitations and biases - **All synthetic** — no human-written content. Subject to whatever biases GLM-4.7-flash carries; in particular, the model's depiction of historical figures is its reconstruction, not their actual writing. Treat the dialogue as *plausible-style*, not *attested*. - **Single generator** — all rows from one model family. A more robust corpus would mix generators; this is a v1 from one provider. - **English-only** despite the multicultural figure pool — Du Fu speaks in English prose, Murasaki Shikibu writes in modern English, etc. The historical-frame authenticity is stylistic, not linguistic. - **No ground-truth answers** for `salon`, `perspective`, `cot_debate`, `qa_pair` — these are reasoning *demonstrations*, not evaluation tasks with correct labels. - **Content filter incidents** — 1 of ~136 generations was rejected by Z.AI's content filter (Alexander Hamilton vs. Rachel Carson on a debate topic) and is absent from the dataset. - **5 of ~850 `[Internal: ...]` markers** in older CoT files (regenerated to `<thinking>` tags) had model formatting failures with no closing bracket and were left as-is rather than risk corrupting the surrounding content. - **Western philosophy overweight** — despite the deliberate inclusion of non-Western figures (Zhuangzi, Nagarjuna, Al-Biruni, Brahmagupta, Murasaki Shikibu, Du Fu, Ibn Khaldun, Hypatia, Avicenna, Lao Tzu, Hildegard von Bingen, Omar Khayyam), the topic pool and idiom of debate skew Western/analytical. - **No multimodal content** — text-only. ## License [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). Free for research, commercial use, and redistribution with attribution. ## Citation ```bibtex @misc{polymath_reasoning_v1_2026, title = {Polymath Reasoning Corpus v1}, author = {Monumental Systems}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/MonumentalSystems/polymath-reasoning-v1} } ```
提供机构:
MonumentalSystems
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作