ZonglinY/TOMATO-Star-SFT-Data-R1D-32B
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ZonglinY/TOMATO-Star-SFT-Data-R1D-32B
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
language:
- en
tags:
- science
- hypothesis-generation
- inspiration-retrieval
- sft
- llama-factory
- biomedical
size_categories:
- 100K<n<1M
---
# TOMATO-Star SFT Data (R1D-32B)
SFT training data for the two core tasks in MOOSE-Star: **Hypothesis Composition (HC)** and **Inspiration Retrieval (IR)**.
All data is generated via **rejection sampling** with [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) as the teacher model, followed by reranker filtering.
All data is in **ShareGPT JSONL format**, directly compatible with [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
## Dataset Description
- **Paper**: [MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier](https://arxiv.org/abs/2603.03756)
- **Base Dataset**: [TOMATO-Star](https://huggingface.co/datasets/ZonglinY/TOMATO-Star)
- **Teacher Model**: [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) (for rejection sampling)
- **License**: CC-BY-4.0
## Files
### Hypothesis Composition (HC)
| File | Samples | Description |
|------|---------|-------------|
| `HC/normal_composition.jsonl` | 96,879 | Standard HC: generate hypothesis from research question + background + inspirations |
| `HC/bounded_composition.jsonl` | 17,669 | Bounded HC: generate hypothesis with imperfect (bounded) inspirations |
| `HC/dataset_info.json` | - | LLaMA-Factory dataset config |
**Recommended mixing**: Combine both files with bounded upsampled 1-2x (paper uses 1x for best results).
### Inspiration Retrieval (IR)
| File | Samples | Description |
|------|---------|-------------|
| `IR/train.jsonl` | 150,218 | 15-way multiple choice: select correct inspiration from 15 candidates |
| `IR/eval.jsonl` | 2,377 | Evaluation split (same format) |
| `IR/dataset_info.json` | - | LLaMA-Factory dataset config |
## Data Format
All files use ShareGPT multi-turn conversation format:
```json
{
"conversations": [
{"role": "user", "content": "[Task instruction + input data]"},
{"role": "assistant", "content": "[Model response]"}
]
}
```
### HC Task Format
- **User**: System instruction for hypothesis composition + research question + background + inspirations (+ previous hypothesis components for bounded mode)
- **Assistant**: Hypothesis with Motivation, Mechanism, and Methodology sections
### IR Task Format
- **User**: Background information + 15 candidate papers (A-O), one correct + 14 hard negatives
- **Assistant**: Selected inspiration ID + reasoning
## Usage with LLaMA-Factory
```bash
# HC Training (mix normal + bounded with 1x upsample)
# 1. Combine: cat normal_composition.jsonl bounded_composition.jsonl > mixed.jsonl
# 2. Point dataset_info.json to the mixed file
# 3. Run LLaMA-Factory SFT
# IR Training
# dataset_info.json is pre-configured, just point LLaMA-Factory to the IR/ directory
```
### dataset_info.json (HC example)
```json
{
"train": {
"file_name": "normal_composition.jsonl",
"formatting": "sharegpt",
"columns": {"messages": "conversations"},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
}
}
```
## Training Details
| Config | HC | IR |
|--------|----|----|
| Base Model | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Qwen-7B |
| Chat Template | deepseekr1 | deepseekr1 |
| Cutoff Length | 8192 | 16384 |
| Learning Rate | 1e-5 | 1e-5 |
| Epochs | 1 | 1 |
| Training | Full-param (ZeRO-3) | Full-param (ZeRO-3) |
## Data Generation Pipeline
- **HC Normal**: Rejection sampling with DeepSeek-R1-Distill-Qwen-32B teacher → reranker filtering → SFT format conversion
- **HC Bounded**: Same pipeline but with bounded (imperfect) inspirations selected by SPECTER2 embedding similarity
- **IR**: Hard negative sampling (keyword overlap + embedding-based + random) → rejection sampling → SFT format conversion
## Citation
```bibtex
@article{yang2025moosestar,
title={MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier},
author={Yang, Zonglin and Bing, Lidong},
year={2025}
}
```
## License
This dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
提供机构:
ZonglinY



