academic-chains-dev
收藏魔搭社区2025-10-30 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/academic-chains-dev
下载链接
链接失效反馈官方服务:
资源简介:
<a href="https://github.com/bespokelabsai/curator/">
<img src="https://huggingface.co/datasets/marcodsn/academic-chains-dev/resolve/main/made_with_curator.png" alt="Made with Curator" width=200px>
</a>
# Dataset Card for Academic Reasoning and Intuition Chains (DEV Snapshot)
> [!Important]
> This dataset is a **snapshot** of our ongoing development work, submitted specifically for the [Reasoning Dataset Competition](https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition) (April 2025). It showcases our latest experimental pipeline, featuring expanded domain coverage, automated cleaning, and a novel LLM-based verification step targeting hypothetical reasoning. **For the most up-to-date version and future improvements, please refer to the `dev` revision of our main dataset repository: [`marcodsn/academic-chains`](https://huggingface.co/datasets/marcodsn/academic-chains?revision=dev).** This temporary `academic-chains-dev` repository will *not* be updated post-competition.
> [!Note]
> **Disclaimer:** Most of what you are going to read below was generated by Gemini 2.5 Pro as I'm almost out of time for this competition, so if this README sounds off compared to my other submission that's why!
## Dataset Description
* **GitHub (Development Branch):** [https://github.com/marcodsn/academic-chains/tree/feature/reorganize-repository](https://github.com/marcodsn/academic-chains/tree/feature/reorganize-repository)
* **Main Dataset (Ongoing Development):** [https://huggingface.co/datasets/marcodsn/academic-chains](https://huggingface.co/datasets/marcodsn/academic-chains)
* **Dataset (This Competition Snapshot):** [https://huggingface.co/datasets/marcodsn/academic-chains-dev](https://huggingface.co/datasets/marcodsn/academic-chains-dev) (this page)

*(The image above is an output from Llama-3.2-3B-Instruct tuned on this dataset, quantized to 8 bit and ran on llama.cpp; In our tests Qwen3-30B-A3B, Gemini 2.5 Pro and Claude Sonnet 3.7 with thinking enabled all got this simple question wrong)*
This dataset represents an evolution in our effort to capture genuine scientific thinking. It contains reasoning (and intuition) chains distilled from open-access research papers, moving beyond simple Q&A or summaries. Our core goal is to generate academically-grounded reasoning chains reflecting the underlying logical structure, argumentation, and crucially, the **hypothetical or intuitive thinking process** researchers might engage in *before* experimental results are confirmed.
This 'DEV' version significantly expands the scope, covering fields like Biology, Economics, Physics, Math, Computer Science, Finance, Statistics, and Electrical Engineering. It was generated using our latest, experimental pipeline incorporating multi-stage quality control, including automated filtering and an innovative LLM-based verification step specifically designed to assess the "hypothetical reasoning" quality.
## Dataset Structure
Each example in this dataset includes the following features:
* `arxiv_id`: Identifier for the source paper on arXiv.
* `paper_doi`: DOI link to the original paper (if available).
* `paper_authors`: List of authors of the paper.
* `paper_published_date`: Date of publication of the paper version used.
* `paper_updated_date`: Date of the last update of the paper version used.
* `conversations`: List of dictionaries representing the interaction. Each entry includes:
* `role`: Either "user" (for the prompt/question) or "assistant" (for the thinking process and answer).
* `content`: The text of the prompt or the assistant's response (including `<think>` tags).
* `entry_type`: Indicates whether the entry contains multiple short reasoning chains (`multi-short`) or a single long chain (`single-long`).
* `categories`: List of academic categories (e.g., 'cs.AI', 'econ.GN') the paper belongs to.
* `avg_thinking_tokens`: Average number of tokens within the `<think>` tags for this example, indicating reasoning complexity/length. Useful for training models with budgeted thinking.
* `model`: The LLM used to *generate* the reasoning chain for this example.
* `content_id`: A unique identifier for this specific reasoning chain instance.
* `verifier_results`: List containing results from the LLM-based verification step (see Quality Control below). Each entry includes:
* `classification`: Whether the example was deemed "Suitable" (hypothetical) or "Unsuitable" (result reporting) by a specific verifier model.
* `justification`: The verifier model's explanation for its classification.
* `model`: The LLM used as the *verifier*.
* `timestamp`: When the verification was performed.
* `suitability_score`: A normalized score (0-1) based on the `verifier_results`, averaging suitability across all verifiers ("Suitable" = 1, "Unsuitable" = 0).
* `suitability`: Final aggregated classification ("Suitable" or "Unsuitable") based on the `suitability_score` (threshold ≥ 0.5).
## Dataset Creation
### Source Data
Reasoning chains were derived from open-access research papers sourced via the [arXiv](https://arxiv.org) API. We expanded coverage significantly for this dataset to include: Quantitative Biology (q-bio), General Economics (econ.GN), Physics (physics), Mathematics (math), Computer Science (cs), Quantitative Finance (q-fin), Statistics (stat), and Electrical Engineering and Systems Science (eess). Text was extracted from source PDFs into Markdown format using our companion pipeline: [marcodsn/arxiv-markdown](https://huggingface.co/datasets/marcodsn/arxiv-markdown).
### Data Generation and Quality Control Pipeline
The creation pipeline for *this specific dataset* involved several stages, orchestrated partly using [Bespoke Curator](https://github.com/bespokelabsai/curator/):
1. **Metadata Gathering & Text Extraction:** Fetching paper metadata and extracting text as described above.
2. **Reasoning Chain Generation\*:** Using various LLMs (`gemini-2.5-flash-preview-04-17`, `gemini-2.0-flash`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `gemini-2.5-pro-exp-03-25`, `deepseek-ai/DeepSeek-V3`) prompted with few-shot examples.
* **Intuition Prompting:** The core instruction guided the LLM to adopt the researcher's perspective *before* experiments, focusing on hypotheses, explorations, and intuitions grounded in core concepts, aiming to avoid simple result summaries.
* Both multiple short (`multi-short`) and single long (`single-long`) chains were targeted per paper.
3. **Quality Control Step 1: Automated Filtering:** Initial automated checks removed examples with missing thinking tags, undesirable references ("the paper", "the text"), or incomplete generations. (Detailed breakdown below).
4. **Quality Control Step 2: LLM-Based Verification\*:** Recognizing that simply asking for intuition doesn't always prevent result reporting, we implemented a verification step. Multiple LLM classifiers (`google/gemma-3-27b-it-qat-q4_0-gguf`, `unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL`, `unsloth/Qwen3-30B-A3B-GGUF:Q5_K_M`) assessed whether each generated chain truly represented hypothetical/conceptual reasoning ("Suitable") or inappropriately reported confirmed results ("Unsuitable"). (Detailed results and analysis below).
5. **Final Formatting:** Structuring the data, including all metadata, generated content, filtering info, and verification results, into the final JSONL format.
\* *These steps utilize [Bespoke Curator](https://github.com/bespokelabsai/curator/)! Check our [generation](https://github.com/marcodsn/academic-chains/blob/feature/reorganize-repository/scripts/data_generation/curator_gemini.py) and [verification](https://github.com/marcodsn/academic-chains/blob/feature/reorganize-repository/scripts/data_processing/verify_dataset.py) scripts!*
> [!Note]
> The full code implementing this experimental generation and verification pipeline is available on the `feature/reorganize-repository` branch of our [academic-chains GitHub repository](https://github.com/marcodsn/academic-chains/tree/feature/reorganize-repository).
### Splits
This `academic-chains-dev` repository contains only the **`train` split (N=1975 examples)**, which represents the output of the full pipeline described above, including filtering and verification. Raw or intermediate data snapshots may be available in the `dev` revision of the main `marcodsn/academic-chains` dataset.
## Example Uses
This dataset is designed to train multi-domain reasoning models, particularly those aiming to emulate scientific intuition, hypothesis generation, and structured thinking. The inclusion of `avg_thinking_tokens` and the `<think>` tag format allows for training models capable of:
* Explicit Chain-of-Thought reasoning.
* Adjusting reasoning "effort" based on a specified budget, akin to features in models like OpenAI's oX series, Google's Gemini 2.5, or Anthropic's Claude Sonnet 3.7.
**Notes:** We hypothesize that mixing this reasoning dataset with high-quality instruction-following datasets could yield synergistic benefits for overall model capabilities. Testing this is part of our future work!
## Planned Evaluation
Evaluation of models fine-tuned *specifically on this dev dataset* is currently planned. Our intended methodology involves:
1. Fine-tuning accessible, performant models (e.g., variants of Llama-3.2-3B, Qwen2.5-7B) using efficient techniques like LoRA via the [unsloth](https://unsloth.ai/) library.
2. Using system prompts that leverage the dataset structure, e.g., `f"You are a helpful assistant. Think before answering and put your thoughts between the <think> and </think> tags. Your thinking budget for this conversation is {avg_thinking_tokens} tokens."`
3. Evaluating performance on relevant reasoning benchmarks, such as MMLU-Pro (focusing on the newly covered domains) or other complex QA datasets.
4. Qualitative analysis of generated reasoning chains.
Results will be shared in the main repository ([`marcodsn/academic-chains`](https://huggingface.co/datasets/marcodsn/academic-chains)) once available.
*(The example image below illustrates an output from Llama-3.2-3B-Instruct tuned on this dataset; Our model found the same general areas of concern as [this paper](https://www.nature.com/articles/s41586-024-07566-y) about the topic, paper not present in the training dataset)*

## Quality Control Step 1: Automated Filtering Results
Before the LLM-based verification, initial filtering was applied based on formatting and instruction following. This provides insights into the raw generation quality of different models *for this task*. The script used is [here](https://github.com/marcodsn/academic-chains/blob/feature/reorganize-repository/scripts/data_processing/process.py).
### Initial Model Distribution (Pre-Filtering)
| Model | Count |
|-----------------------------------------------------|-------|
| gemini-2.5-flash-preview-04-17 | 1818 |
| gemini-2.0-flash | 1000 |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 200 |
| gemini-2.5-pro-exp-03-25 | 130 |
| deepseek-ai/DeepSeek-V3 | 38 |
| **Total Initial** | **3186**|
### Summary: Model Counts After Automated Filtering (Input to Verification)
| Model | Initial | Final (Post-Filter) | Change | Change % |
|-----------------------------------------------------|---------|--------------------|--------|----------|
| gemini-2.5-flash-preview-04-17 | 1818 | 1173 | -645 | -35.48% |
| gemini-2.0-flash | 1000 | 500 | -500 | -50.00% |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 200 | 179 | -21 | -10.50% |
| gemini-2.5-pro-exp-03-25 | 130 | 87 | -43 | -33.08% |
| deepseek-ai/DeepSeek-V3 | 38 | 36 | -2 | -5.26% |
| **Total** | **3186**| **1975** | **-1211**| **-38.79%**|
### Analysis on Automated Filtering
This initial filtering stage revealed challenges, particularly with Gemini models (using our prompts):
1. **Missing `<think>` Tags:** Some models failed to consistently follow the thinking tag instruction (e.g., 20% of `gemini-2.5-pro-exp-03-25` examples removed).
2. **Undesired References:** Frequent references to "the text" or "the paper" (violating the desired persona) led to significant filtering, especially for `gemini-2.0-flash` (39% loss).
3. **Incomplete Generations:** Truncated or improperly ended responses caused further removals (e.g., 17% for `gemini-2.0-flash`).
4. **Stronger Performers (Filtering):** Llama-4-Maverick and DeepSeek-V3 showed much better adherence to basic formatting and negative constraints in this stage, though DeepSeek's sample size was small.
## Quality Control Step 2: LLM-Based Verification Results (Beta)
**The Challenge:** Generating true *hypothetical* reasoning is hard! We observed that even with careful prompting, models sometimes defaulted to summarizing confirmed results from the paper.
**Our Solution:** We implemented an LLM-based verification step using multiple classifiers to explicitly evaluate this. Each generated reasoning chain was assessed: Is this primarily hypothetical/conceptual ("Suitable") or does it report specific quantitative results, confirmed outcomes, or experimental findings from the source paper ("Unsuitable")?
This dataset includes the `verifier_results`, `suitability_score`, and final `suitability` fields derived from this process. We used `google/gemma-3-27b-it-qat-q4_0-gguf`, `unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL`, and `unsloth/Qwen3-30B-A3B-GGUF:Q5_K_M` as verifiers. The prompt and criteria are [here](https://github.com/marcodsn/academic-chains/blob/feature/reorganize-repository/prompts/verifier.jsonl).
> [!Note]
> An entry is marked "Suitable" overall if its average `suitability_score` across verifiers is ≥ 0.5. The raw results allow users to apply stricter thresholds.
### Verification Results by Generator and Verifier Model
*(Detailed tables showing Suitable/Unsuitable counts for each Generator x Verifier pair are omitted here for brevity, but the analysis reflects them. The full data is in the dataset itself.)*
| Generator Model | Overall Suitability Rate (Avg. across Verifiers) |
|-----------------------------------------------------|-------------------------------------------------|
| deepseek-ai/DeepSeek-V3 | ~78.0% |
| gemini-2.0-flash | ~82.6% |
| gemini-2.5-flash-preview-04-17 | ~83.8% |
| gemini-2.5-pro-exp-03-25 | ~89.7% |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | ~53.8% |
| **OVERALL (Aggregated Post-Filtering)** | **~83.6%** |
*(Note: Suitability rates above are approximate averages based on the detailed per-verifier results.)*
### Verifier Model Agreement Analysis
Consistency between verifiers helps gauge the robustness of the "Suitability" classification:
| Verifier Pair | Agreement Rate |
|-----------------------------------------------------------------------------|----------------|
| `Mistral-Small-3.1` & `Qwen3-30B-A3B` | 84.5% |
| `gemma-3-27b` & `Mistral-Small-3.1` | 79.0% |
| `gemma-3-27b` & `Qwen3-30B-A3B` | 78.2% |
| **Unanimous Agreement (All 3 Models on 1971 shared items)** | **70.8%** |
### Analysis of Verification Results
1. **Targeted Quality Improvement:** The verification step successfully identifies and flags examples leaning towards result-reporting (~16.4% marked Unsuitable overall), enriching the dataset for the intended hypothetical reasoning task.
2. **Multi-Verifier Robustness:** Reasonable agreement rates (78-85%) suggest the "hypothetical vs. reporting" distinction is somewhat consistently interpreted by different models, though edge cases exist (evidenced by the ~71% unanimous agreement).
3. **Generator Performance (Verification):**
* **Gemini Models:** Showed high suitability rates *after* initial filtering, suggesting that when they follow basic instructions, they often grasp the hypothetical nuance reasonably well (especially `gemini-2.5-pro`).
* **DeepSeek-V3:** Maintained good quality through verification.
* **Llama-4-Maverick:** Despite strong initial filtering performance, struggled most with the *type* of reasoning, having the lowest suitability rate. This suggests difficulty with more subtle prompt nuances compared to basic instruction following.
4. **Dataset Utility:** Users can filter on `suitability == "Suitable"` for a high-precision hypothetical reasoning dataset or use the `suitability_score` for more nuanced selection.
## Limitations and Biases
* **Source Bias:** Inherits topic, style, and potential biases from the selected arXiv papers. Fields with less open-access representation might be underrepresented.
* **Extraction Fidelity:** LLM generation, even when prompted for grounding, can introduce errors (hallucination, misinterpretation). The verification step aims to reduce, but not eliminate, misaligned content.
* **Pipeline Status:** This data comes from our *experimental* pipeline. While functional, it's undergoing refinement. The verification models/prompts are also subject to improvement.
* **Scope Definition:** Defining "hypothetical reasoning" vs. "result reporting" can be subjective; the LLM verification provides a scalable proxy, but may not perfectly match human judgment in all cases.
## Scaling Plan
Our ongoing development roadmap includes:
1. **Expanded Domain Coverage:** Continue adding diverse scientific and potentially social science/humanities fields. (**Partially addressed in this dataset, with plans to expand further.**)
2. **Increased Volume:** Scale generation significantly as compute resources allow.
3. **Enhanced Quality Verification:** Refine the multi-verifier system, potentially incorporating human feedback loops and testing different verifier models/prompts. (**Multi-verifier implemented in this dataset, refinement ongoing.**)
4. **Multi-modal Reasoning:** Explore extending the approach to extract reasoning chains involving charts, diagrams, and mathematical equations present in papers.
5. **Improved Generator Models:** Leverage newer, potentially more capable models for generation as they become available.
## Acknowledgements
Huge thanks to my team at [Noetic Labs](https://huggingface.co/NoeticLabs) for their support! Massive appreciation to [HuggingFace](https://huggingface.co/), [Bespoke Labs](https://www.bespokelabs.ai/), and [Together AI](https://together.ai/) for organizing this competition and fostering innovation in reasoning datasets. And most importantly, profound gratitude to the Academic Community and the authors of countless open-access papers – your work makes projects like this possible. THANK YOU!
## Licensing Information
This dataset is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt).
## Citation Information
If you use this dataset concept or data, please cite the main dataset repository where development continues:
```
@misc{marcodsn_2025_academicchains,
title = {Academic Reasoning and Intuition Chains Dataset},
author = {Marco De Santis},
month = {April},
year = {2025},
url = {https://huggingface.co/datasets/marcodsn/academic-chains}
}
```
## Why Choose This Submission? 🚀
* **Tackles a Novel Reasoning Frontier:** We go beyond standard CoT by targeting the subtle, crucial skill of *hypothetical scientific reasoning and intuition* – training LLMs to think like researchers *before* the results are in.
* **Demonstrates Rapid Evolution & Rigor:** This dataset showcases a significantly enhanced pipeline developed *during* the competition, featuring broader domains, automated cleaning, and an innovative multi-verifier LLM system for targeted quality control.
* **Addresses a Core Challenge Head-On:** We didn't just prompt for intuition; we built a verification system to actively filter out result-reporting, yielding a dataset more precisely aligned with the intended reasoning type.
* **Data-Driven Insights & Transparency:** We provide detailed analyses of generator performance, filtering effectiveness, and verifier agreement, offering valuable insights into the nuances of generating specialized reasoning data. All code and prompts are shared.
* **Built for Scalability & Impact:** The pipeline improvements (Markdown extraction, multi-stage QC) are designed for creating larger, higher-quality datasets to train the next generation of scientifically-minded LLMs.
Choose this submission to recognize **significant progress**, **methodological innovation** in quality control, and a dedicated effort to create **high-fidelity, specialized reasoning data** for a challenging but vital domain. Let's make LLMs true scientific partners! 😎
## Development Updates (Leading to this DEV Snapshot)
> [!Note]
> **[23/04/2025]** Initiated larger scale generation pipeline using [arxiv-markdown](https://huggingface.co/datasets/marcodsn/arxiv-markdown) extraction.
> [!Note]
> **[25/04/2025]** Development focused on the [new branch](https://github.com/marcodsn/academic-chains/tree/feature/reorganize-repository) with improved structure and a more efficient [Curator-based pipeline](https://github.com/marcodsn/academic-chains/blob/feature/reorganize-repository/scripts/data_generation/curator_gemini.py).
> [!Note]
> **[26/04/2025]** Data generated by this new pipeline first shared via the `dev` revision on the main repo.
> [!Note]
> **[27/04/2025]** Work began on the LLM-based verifier ([script](https://github.com/marcodsn/academic-chains/blob/feature/reorganize-repository/scripts/data_processing/verify_dataset.py), [prompt](https://github.com/marcodsn/academic-chains/blob/feature/reorganize-repository/prompts/verifier.jsonl)).
> [!Note]
> **[28/04/2025]** Verification step successfully integrated; results included in this dataset and analyzed above. Multi-verifier approach added shortly after.
> [!Note]
> **[29/04/2025]** I graduated!! 🎉
> [!Note]
> **[30/04/2025]** This `academic-chains-dev` repository created as a distinct competition submission, snapshotting the verified data from the updated pipeline.
提供机构:
maas
创建时间:
2025-05-01



