botcoinmoney/dacr-bench-results

Name: botcoinmoney/dacr-bench-results
Creator: botcoinmoney
Published: 2026-04-08 07:11:52
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/botcoinmoney/dacr-bench-results

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - multi-hop-reasoning - document-reasoning - causal-reasoning - benchmark-results - fine-tuning - synthetic-data - transfer-learning - adversarial-reasoning - confidence-calibration - dacr-bench task_categories: - question-answering pretty_name: "DACR-Bench Results: Synthetic-to-Real Document Reasoning Transfer" size_categories: - n<1K configs: - config_name: questions data_files: "data/questions.parquet" description: "Per-question results with baseline and fine-tuned predictions (100 questions, 10 challenges)" - config_name: challenges data_files: "data/challenges.parquet" description: "Per-challenge aggregate scores for baseline and fine-tuned models (10 challenges)" --- # DACR-Bench Results: Synthetic-to-Real Document Reasoning Transfer Evaluation results demonstrating that fine-tuning a 7B model on **4,421 procedurally generated** document reasoning traces **more than doubles accuracy on real arXiv papers** (18.9% → 40.0% on real documents in DACR-Bench), with gains concentrated in multi-hop reasoning, numerical computation, and causal authority resolution under conflicting information. All results reported below are on **real arXiv documents only** (9 challenges, 90 questions) across 9 scientific domains — none of which appeared in training. ## Three Core Findings ### 1. Synthetic-to-Real Transfer Works for Document Reasoning Training on procedurally generated documents with fictional entities and fabricated data produces measurable improvement on real arXiv papers across 9 scientific domains. The gains are **skill-specific** — multi-step reasoning improves dramatically while single-hop extraction stays flat, suggesting the model learns **decomposable reasoning patterns** rather than surface-level domain knowledge. ### 2. Causal Authority Resolution Is a Trainable Skill When documents contain conflicting information (e.g., a preliminary estimate of "15 datasets" in the introduction vs. the actual "12 datasets" in the results), models must perform causal reasoning to determine which value is authoritative. The baseline falls for 92% of such conflicts. After training on procedurally generated examples with similar conflicts, the fine-tuned model correctly identifies the authoritative source **46% of the time** on real papers — comparable to the improvement WikiContradict achieves through prompting (10% → 44%), but ours generalizes to unseen documents without requiring conflict-aware prompts. ### 3. The Model Learns to Engage With Hard Questions It Previously Couldn't Attempt The baseline produces valid JSON for only 26/100 questions — and on those, it's 69% accurate. It's a capable reasoner that simply can't produce structured output for long, complex documents. The fine-tuned model answers 60/100 questions, including 42 the baseline couldn't attempt at all. Of those 42 newly answerable questions: - **52% answered correctly** overall - **44% correct on multi-hop reasoning** (2+ hops, n=30) - **44% correct on computation** (n=9) - **71% correct on conflict-targeted** questions (n=7) - **100% correct on conditional filtering** (n=3) A model that only learned better formatting would score ~0% on multi-hop and computation questions. Instead, it demonstrates genuine reasoning on question types the baseline couldn't even produce output for. The fine-tuning didn't make the model smarter at questions it could already answer (head-to-head accuracy is comparable: 72% baseline vs 78% fine-tuned on 18 shared questions). It **expanded the model's capability frontier** — enabling it to reason about and produce structured answers for complex multi-step problems over long documents. ## Results All results are **deterministic** (`temperature=0, do_sample=False`). Fully reproducible. ### Overall (Real arXiv Documents, 9 challenges, 90 questions) | Metric | Baseline (Qwen 2.5 7B) | Fine-tuned | Delta | |--------|------------------------|------------|-------| | Answer Accuracy | 18.9% | **40.0%** | **+21.1% (2.1×)** | | Pass Rate (≥ 60%) | 0/9 | **4/9 (44%)** | **+44%** | | Causal Authority Resolution | 7.7% | **46.2%** | **+38.5%** | ### By Reasoning Category (Real Documents) | Category | What It Tests | Baseline | Fine-tuned | Delta | |----------|--------------|----------|------------|-------| | direct_extraction | Single fact lookup (1 hop) | 58.3% | 58.3% | 0.0% | | **multi_hop_bridge** | Connect facts via bridge entity (2-3 hops) | 5.6% | **33.3%** | **+27.8%** | | comparative | Compare entities on same attribute | 11.1% | 11.1% | 0.0% | | **computation** | Extract numbers and compute derived value | 0.0% | **28.6%** | **+28.6%** | | **conditional_filtered** | Apply filter condition before answering | 0.0% | **42.9%** | **+42.9%** | | **conflict_targeted** | Identify authoritative source amid conflicting info | 7.7% | **46.2%** | **+38.5%** | | **cross_section_synthesis** | Integrate 3+ document sections | 0.0% | **40.0%** | **+40.0%** | ### By Reasoning Depth (Real Documents) | Hops | Baseline | Fine-tuned | Delta | |------|----------|------------|-------| | 1 (single lookup) | 43.8% | 59.4% | +15.6% | | **2 (bridge entity)** | **7.3%** | **31.7%** | **+24.4%** | | **3 (chain reasoning)** | **0.0%** | **23.1%** | **+23.1%** | | **4+ (deep synthesis)** | **0.0%** | **25.0%** | **+25.0%** | ### By Domain (all real arXiv papers) | Domain | Baseline | Fine-tuned | Delta | |--------|----------|------------|-------| | physics | 0.0% | **80.0%** | +80.0% | | economics | 20.0% | **70.0%** | +50.0% | | biology | 50.0% | **70.0%** | +20.0% | | medicine | 20.0% | **60.0%** | +40.0% | | chemistry | 20.0% | **50.0%** | +30.0% | ## What Is DACR-Bench? DACR-Bench (Domain-Agnostic Causal Reasoning Benchmark) evaluates multi-hop reasoning over real technical documents. Each challenge gives the model a full scientific paper and 10 questions spanning 7 reasoning categories. Key features: - **Real documents.** ~60% of challenges use actual arXiv papers across biology, chemistry, physics, medicine, economics, CS/AI, NLP, materials science, and climate science. - **Adversarial traps.** Conflicting information is planted at different document locations. The model must identify the causally authoritative source without being warned. - **7 question categories** testing distinct reasoning skills: direct extraction, multi-hop bridging, comparison, computation, conditional filtering, trap resolution, and cross-section synthesis. - **Graded complexity.** Questions range from 1-hop (simple lookup) to 4+ hops (deep multi-section synthesis), with difficulty labels (easy/medium/hard). - **Structured output.** Models must produce JSON with answer, citation, and confidence per question — testing both reasoning ability and format compliance. ### How DACR-Bench Compares to Existing Benchmarks | Feature | MuSiQue | HotpotQA | BRIDGE | DocHop-QA | ConflictBank | **DACR-Bench** | |---------|---------|----------|--------|-----------|-------------|----------------| | Real documents | Wikipedia | Wikipedia | arXiv | PubMed | Wikidata | **arXiv** | | Multi-hop (3+) | Yes | No (2 only) | Yes | Yes | No | **Yes** | | Numerical computation | No | No | No | No | No | **Yes** | | Adversarial traps | No | No | No | No | Partial | **Yes** | | Citation grounding | No | Sentence | Evidence | No | No | **Yes** | | Multi-domain | No | No | CS only | Biomed | Yes | **Yes (10+)** | ## Dataset Structure ### `questions` config (100 rows) Each row is a single question with both model predictions: | Column | Description | |--------|-------------| | `challenge_id` | Challenge identifier | | `question_id` | Question ID (q01-q10) | | `domain` | Scientific domain | | `source_type` | `document` (real arXiv) or `engine` (synthetic) | | `question_text` | The question | | `category` | Reasoning category (7 types) | | `hops` | Reasoning depth (1-4+) | | `difficulty` | easy, medium, hard | | `gold_answer` | Ground truth | | `is_trap` | Whether question targets planted conflicting info | | `reasoning_chain` | Step-by-step derivation (JSON) | | `baseline_answer` / `baseline_correct` | Baseline model prediction and correctness | | `finetuned_answer` / `finetuned_correct` | Fine-tuned model prediction and correctness | | `*_confidence` | Model's stated confidence (0.0-1.0) | ### `challenges` config (10 rows) Per-challenge aggregate scores with accuracy deltas. ## Training Details | Parameter | Value | |-----------|-------| | Base model | Qwen/Qwen2.5-7B-Instruct | | Training data | [domain-agnostic-causal-reasoning-tuning](https://huggingface.co/datasets/botcoinmoney/domain-agnostic-causal-reasoning-tuning) (sft_reasoning_v2) | | Train examples | 4,421 synthetic document reasoning traces | | Method | QLoRA (4-bit NF4, r=32, alpha=64) | | Loss | **Completion-only** (assistant response tokens only — critical for transfer) | | Max sequence length | 13,000 tokens (no document truncation) | | Epochs | 1 | | Hardware | 1× NVIDIA H100 80GB, ~2.8 hours | | Evaluation | Deterministic (temperature=0) | The training data is generated by a decentralized challenge network where multiple frontier AI agents independently solve structured document reasoning challenges, producing naturally diverse reasoning traces graded by automated verifiers across 8 quality dimensions (answer accuracy, constraint satisfaction, citation accuracy, trap avoidance, reasoning depth, trace quality, composite score, pass/fail). ## Usage ```python from datasets import load_dataset # Per-question results questions = load_dataset("botcoinmoney/dacr-bench-results", "questions", split="train") # Multi-hop questions only multi_hop = questions.filter(lambda x: x["hops"] >= 2) b_acc = sum(multi_hop["baseline_correct"]) / len(multi_hop) f_acc = sum(multi_hop["finetuned_correct"]) / len(multi_hop) print(f"Multi-hop: baseline={b_acc:.1%}, finetuned={f_acc:.1%}") # Trap questions traps = questions.filter(lambda x: x["is_trap"]) print(f"Trap evasion: baseline={sum(traps['baseline_correct'])/len(traps):.1%}, " f"finetuned={sum(traps['finetuned_correct'])/len(traps):.1%}") # Per-challenge summaries challenges = load_dataset("botcoinmoney/dacr-bench-results", "challenges", split="train") ``` ## Reproduction Code: [botcoinmoney/synthetic-reasoning-transfer](https://github.com/botcoinmoney/synthetic-reasoning-transfer) (training script, benchmark runner, evaluation harness) Benchmark: [botcoinmoney/dacr-bench](https://github.com/botcoinmoney/dacr-bench) ## All Results Are On Real arXiv Documents All results reported above are measured exclusively on 9 real arXiv papers (90 questions) across biology, chemistry, climate science, CS/AI, economics, materials science, medicine, NLP, and physics. None of these documents appeared in the training data. The training data consists entirely of procedurally generated documents with fictional entities — the transfer to real scientific papers is genuine cross-domain generalization. ## Confidence Calibration Note Neither baseline nor fine-tuned model shows meaningful confidence calibration on answered questions. Both assign ~90% confidence regardless of correctness. The confidence gap between correct and wrong answers is < 1% for both models. SFT does not teach uncertainty awareness — this likely requires RL-based training with the dataset's pre-computed reward signals. ## Citation ```bibtex @dataset{dacr_bench_results_2026, title={DACR-Bench Results: Synthetic-to-Real Document Reasoning Transfer}, author={botcoinmoney}, year={2026}, url={https://huggingface.co/datasets/botcoinmoney/dacr-bench-results} } ```

提供机构：

botcoinmoney

5,000+

优质数据集

54 个

任务类型

进入经典数据集