ZonglinY/TOMATO-Star-SFT-Data-R1D-32B

Name: ZonglinY/TOMATO-Star-SFT-Data-R1D-32B
Creator: ZonglinY
Published: 2026-03-05 07:13:19
License: 暂无描述

Hugging Face2026-03-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ZonglinY/TOMATO-Star-SFT-Data-R1D-32B

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation language: - en tags: - science - hypothesis-generation - inspiration-retrieval - sft - llama-factory - biomedical size_categories: - 100K<n<1M --- # TOMATO-Star SFT Data (R1D-32B) SFT training data for the two core tasks in MOOSE-Star: **Hypothesis Composition (HC)** and **Inspiration Retrieval (IR)**. All data is generated via **rejection sampling** with [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) as the teacher model, followed by reranker filtering. All data is in **ShareGPT JSONL format**, directly compatible with [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). ## Dataset Description - **Paper**: [MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier](https://arxiv.org/abs/2603.03756) - **Base Dataset**: [TOMATO-Star](https://huggingface.co/datasets/ZonglinY/TOMATO-Star) - **Teacher Model**: [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) (for rejection sampling) - **License**: CC-BY-4.0 ## Files ### Hypothesis Composition (HC) | File | Samples | Description | |------|---------|-------------| | `HC/normal_composition.jsonl` | 96,879 | Standard HC: generate hypothesis from research question + background + inspirations | | `HC/bounded_composition.jsonl` | 17,669 | Bounded HC: generate hypothesis with imperfect (bounded) inspirations | | `HC/dataset_info.json` | - | LLaMA-Factory dataset config | **Recommended mixing**: Combine both files with bounded upsampled 1-2x (paper uses 1x for best results). ### Inspiration Retrieval (IR) | File | Samples | Description | |------|---------|-------------| | `IR/train.jsonl` | 150,218 | 15-way multiple choice: select correct inspiration from 15 candidates | | `IR/eval.jsonl` | 2,377 | Evaluation split (same format) | | `IR/dataset_info.json` | - | LLaMA-Factory dataset config | ## Data Format All files use ShareGPT multi-turn conversation format: ```json { "conversations": [ {"role": "user", "content": "[Task instruction + input data]"}, {"role": "assistant", "content": "[Model response]"} ] } ``` ### HC Task Format - **User**: System instruction for hypothesis composition + research question + background + inspirations (+ previous hypothesis components for bounded mode) - **Assistant**: Hypothesis with Motivation, Mechanism, and Methodology sections ### IR Task Format - **User**: Background information + 15 candidate papers (A-O), one correct + 14 hard negatives - **Assistant**: Selected inspiration ID + reasoning ## Usage with LLaMA-Factory ```bash # HC Training (mix normal + bounded with 1x upsample) # 1. Combine: cat normal_composition.jsonl bounded_composition.jsonl > mixed.jsonl # 2. Point dataset_info.json to the mixed file # 3. Run LLaMA-Factory SFT # IR Training # dataset_info.json is pre-configured, just point LLaMA-Factory to the IR/ directory ``` ### dataset_info.json (HC example) ```json { "train": { "file_name": "normal_composition.jsonl", "formatting": "sharegpt", "columns": {"messages": "conversations"}, "tags": { "role_tag": "role", "content_tag": "content", "user_tag": "user", "assistant_tag": "assistant" } } } ``` ## Training Details | Config | HC | IR | |--------|----|----| | Base Model | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Qwen-7B | | Chat Template | deepseekr1 | deepseekr1 | | Cutoff Length | 8192 | 16384 | | Learning Rate | 1e-5 | 1e-5 | | Epochs | 1 | 1 | | Training | Full-param (ZeRO-3) | Full-param (ZeRO-3) | ## Data Generation Pipeline - **HC Normal**: Rejection sampling with DeepSeek-R1-Distill-Qwen-32B teacher → reranker filtering → SFT format conversion - **HC Bounded**: Same pipeline but with bounded (imperfect) inspirations selected by SPECTER2 embedding similarity - **IR**: Hard negative sampling (keyword overlap + embedding-based + random) → rejection sampling → SFT format conversion ## Citation ```bibtex @article{yang2025moosestar, title={MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier}, author={Yang, Zonglin and Bing, Lidong}, year={2025} } ``` ## License This dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.

提供机构：

ZonglinY

5,000+

优质数据集

54 个

任务类型

进入经典数据集