m1k-tokenized
收藏魔搭社区2025-11-27 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/UCSC-VLAA/m1k-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<h1>
<b>m1</b>: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
</h1>
<p>
A simple test-time scaling strategy, with minimal fine-tuning, can unlock strong medical reasoning within large language models.
</p>
</div>
## ⚡ Introduction
Hi! Welcome to the huggingface repository for m1 ([Github](https://github.com/UCSC-VLAA/m1), [Paper](https://huggingface.co/papers/2504.00869))!
**m1** is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:
- **Fine-tuning on a small, high-quality set of verified medical reasoning examples**, showing that even with just 1K–23K examples, m1-7B *surpasses* models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B *rivals* 70B-scale models.
- **Scaling reasoning at inference using token budgets**, which consistently improves performance across medical QA tasks—up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.
- **Identifying medical knowledge as the key bottleneck**, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.
-
<div align="center">
<h1>
<b>m1</b>: 解锁大语言模型(Large Language Model, LLM)医疗推理的测试时缩放潜能
</h1>
<p>
仅需极少量微调的极简测试时缩放策略,即可在大语言模型中实现出色的医疗推理能力。
</p>
</div>
## ⚡ 引言
欢迎访问m1的Hugging Face仓库([GitHub](https://github.com/UCSC-VLAA/m1)、[论文](https://huggingface.co/papers/2504.00869))!
**m1**是一款旨在通过高效测试时缩放提升推理能力的医疗大语言模型。它通过延长推理阶段的“思考”过程,让轻量化模型能够匹敌甚至超越规模更大的同类模型。与依赖复杂强化学习(Reinforcement Learning, RL)或专家监督的方法不同,m1通过以下方式实现出色性能:
- **在少量经过验证的高质量医疗推理样本上进行微调**:仅需1000至23000条样本,m1-7B模型即可超越HuatuoGPT-o1-7B、UltraMedical-8B等模型,而m1-32B模型则可匹敌70B参数规模的模型。
- **基于Token预算在推理阶段扩展推理能力**:该方法可持续提升医疗问答(QA)任务的性能,最优Token预算约为4000,超过该阈值后,模型可能因过度思考导致性能下降。
- **明确医疗知识为核心瓶颈**:研究表明,仅靠额外的推理过程无法弥补知识缺口,性能提升需要更优质的数据与更大的模型容量。
-
提供机构:
maas
创建时间:
2025-04-21



