five

MiniByte-666/Dr.SCI

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MiniByte-666/Dr.SCI
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - question-answering language: - en size_categories: - 100K<n<1M --- # Dr. SCI Dataset (Reproduced) <p align="center"> <a href="https://arxiv.org/abs/2602.08321"><b>[📜 Original Paper]</b></a> • <a href="https://huggingface.co/datasets/MiniByte-666/Dr.SCI"><b>[🤗 Reproduced Dataset]</b></a> • <a href="https://github.com/MiniByte-666/Dr.SCI"><b>[💻 Reproduced Github]</b></a> </p> > **Disclaimer:** This is an **unofficial reproduction** of the Dr. SCI dataset introduced in > *"Improving Data and Reward Design for Scientific Reasoning in Large Language Models"* [[arXiv]](https://arxiv.org/abs/2602.08321). > A detailed implementation of the curation process is available in my [GitHub Repo](https://github.com/MiniByte-666/Dr.SCI). > This work is **not** affiliated with or endorsed by the original authors. Please refer to the original paper for authoritative details. This repository hosts my reproduced Dr. SCI dataset. Using this dataset, I successfully reproduced strong performance gains on [Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) across scientific reasoning benchmarks, as shown in the table below. My reproduced results are slightly lower than those reported in the original paper, likely because I used [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) to generate SFT responses rather than using the original authors' SFT data. That minor gap aside, the results speak volumes about the robustness and reproducibility of the original method — it just works. <table align="center"> <thead> <tr> <th align="left">Model</th> <th align="center">GPQA-D</th> <th align="center">SuperGPQA</th> <th align="center">GPQA-G</th> <th align="center">HLE</th> <th align="center">MMLU-Pro</th> </tr> </thead> <tbody> <tr> <td colspan="6" align="center"><i>Base Model</i></td> </tr> <tr> <td>Qwen3-4B-Base</td> <td align="center">36.7</td> <td align="center">28.5</td> <td align="center">5.62</td> <td align="center">0.92</td> <td align="center">50.6</td> </tr> <tr> <td colspan="6" align="center"><i>Thinking Mode</i></td> </tr> <tr> <td>o1-mini</td> <td align="center">60.0</td> <td align="center">45.2</td> <td align="center">25.8</td> <td align="center">5.68</td> <td align="center">80.3</td> </tr> <tr> <td>Qwen3-4B thinking</td> <td align="center">55.9</td> <td align="center">42.7</td> <td align="center">20.9</td> <td align="center">4.52</td> <td align="center">70.4</td> </tr> <tr> <td>Dr. SCI-4B-think (reported)</td> <td align="center">63.2</td> <td align="center">45.7</td> <td align="center">32.4</td> <td align="center">6.12</td> <td align="center">75.6</td> </tr> <tr> <td><b>Dr. SCI-4B-Think (reproduced)</b></td> <td align="center">62.7</td> <td align="center">44.8</td> <td align="center">31.2</td> <td align="center">5.86</td> <td align="center">74.8</td> </tr> <tr> <td colspan="6" align="center"><i>Non-thinking (Instruct) Mode</i></td> </tr> <tr> <td>gpt-4o</td> <td align="center">50.0</td> <td align="center">44.4</td> <td align="center">22.4</td> <td align="center">3.48</td> <td align="center">74.6</td> </tr> <tr> <td>Qwen3-4B non-thinking</td> <td align="center">41.7</td> <td align="center">32.0</td> <td align="center">9.74</td> <td align="center">4.44</td> <td align="center">58.0</td> </tr> <tr> <td>Dr. SCI-4B-instruct (reported)</td> <td align="center">56.6</td> <td align="center">43.6</td> <td align="center">24.3</td> <td align="center">5.36</td> <td align="center">71.0</td> </tr> <tr> <td><b>Dr. SCI-4B-Instruct (reproduced)</b></td> <td align="center">53.5</td> <td align="center">42.9</td> <td align="center">23.7</td> <td align="center">5.08</td> <td align="center">68.8</td> </tr> </tbody> </table> ## Dataset Curation and Statistics We followed the instructions in the original paper to implement the data-processing pipeline (see [GitHub](https://github.com/MiniByte-666/Dr.SCI)). We processed public scientific datasets — including [MegaScience](https://huggingface.co/datasets/MegaScience/MegaScience), [NaturalReasoning](https://huggingface.co/datasets/facebook/natural_reasoning), [WebInstruct-verified](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified), and [RaR-Science](https://huggingface.co/datasets/anisha2102/RaR-Science) — using our implemented pipeline. Our reproduced Dr. SCI dataset consists of **890,505** scientific reasoning samples, each paired with a subject label, reference answer, difficulty level, and rubrics for open-ended questions. Of these, **414,746** are verifiable questions and the remaining **475,759** are open-ended questions. This is close to the original paper's 461K verifiable and 545K open-ended questions. Each data sample has the following format: ``` python { "data_source": 'Dr. SCI', "prompt": [{"role": "user", "content": '<CONTENT>'}], # Question within an instruction template "reward_model": {"style": "rule", "ground_truth": ground_truth,'rubric': list_of_rubrics}, # The style here just works with verl and has no meaning. It is not equal to the verification method of the question. "extra_info": extra_info, # extra_info for verification } ``` where `extra_info` has the following format: ``` python { "question": "<QUESTION>", # original question "reference_answer": "<REF_ANSWER>", "subject": '<SUBJECT>', # ['math', 'physics', 'chemistry', 'biology', 'cs', 'medicine', 'economics', 'science'] "match_rule": match_rule, # [True, False]. True means verifiable, False means open-ended "from":"<ORI_DATASET>", # Original source dataset: MegaScience, NaturalReasoning, etc. "difficulty": difficulty # [0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]. Difficulty value } ``` ## Acknowledgements All credit goes to the original authors of Dr. SCI, and their paper is a model of clarity and reproducibility. For any questions about the methodology, please contact the original authors directly — they are the real experts. I am just an enthusiast with no special insight to offer beyond "I followed the paper and it worked." ## Citation If you find this work useful, **please cite the original paper**: ```bibtex @article{chen2026improving, title={Improving Data and Reward Design for Scientific Reasoning in Large Language Models}, author={Chen, Zijie and Lin, Zhenghao and Liu, Xiao and Lan, Zhenzhong and Gong, Yeyun and Cheng, Peng}, journal={arXiv preprint arXiv:2602.08321}, year={2026} } ```
提供机构:
MiniByte-666
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作