eduardofarina/seed-of-thought-healthcare
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/eduardofarina/seed-of-thought-healthcare
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- medical
- healthcare
- differential-diagnosis
- prompting
- seed-of-thought
- simplestrat
- medqa
size_categories:
- n<1K
---
# 🌱 Seed of Thought for Healthcare Differential Diagnosis
Implementation of the **SimpleStrat / Seed-of-Thought** methodology
([Wong et al., 2024, arXiv:2410.09038](https://arxiv.org/abs/2410.09038)) applied to clinical
differential diagnosis on [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options).
## 🧬 The Challenge
LLMs exhibit **mode collapse** in clinical diagnosis — they repeatedly predict the same top-1
diagnosis, ignoring the full differential. This is dangerous in medicine where rare but serious
conditions must be considered. Standard temperature sampling doesn't help because it adds noise,
not systematic diversity.
## 🌱 Seed of Thought Methodology
The pipeline has 3 stages:
### Stage 1: Auto-Stratification
The LLM identifies **discriminating clinical properties** that split the answer space into roughly
equal halves. For 4-option MCQs, ideal properties split 2:2.
- Example: "Does the mechanism involve direct cellular damage?" → Options A,C satisfy; B,D don't
### Stage 2: Heuristic Estimation
For each property, the LLM estimates the probability using calibrated Bayesian reasoning
(inspired by superforecasting techniques from Tetlock).
### Stage 3: Probabilistic Prompting
A stratum (combination of property True/False values) is **sampled** from the joint distribution.
The LLM generates an answer **constrained** to that stratum. Different samples explore different
regions of the answer space.
## 📊 Results
| Metric | Baseline (temp=0.9) | Seed of Thought | Δ |
|--------|---------------------|-----------------|---|
| accuracy | 0.9333 | 0.9333 | +0.0000 |
| diversity_ratio | 0.1300 | 0.2533 | +0.1233 |
| unique_count | 1.3000 | 2.5333 | +1.2333 |
| coverage | 0.3417 | 0.5667 | +0.2250 |
| kl_divergence_from_uniform | 1.2773 | 0.8354 | -0.4419 |
### Setup
- **Model**: Qwen/Qwen2.5-72B-Instruct
- **Dataset**: MedQA-USMLE test split (30 questions)
- **Samples/question**: 10
- **Strata**: 3
- **Baseline temp**: 0.9 | **SoT temp**: 0.7
- **Total calls**: 840 | **Time**: 202.8min
## 🔑 Key Insight
By partitioning the diagnostic space into orthogonal dimensions and sampling systematically,
Seed-of-Thought achieves:
1. **Higher coverage** — more plausible diagnoses considered
2. **Better diversity** — genuinely different answers per sample
3. **Lower KL divergence** — more uniform distribution across valid options
4. **Maintained accuracy** — correct answer still appears in generated set
## Citation
```bibtex
@article{wong2024simplestrat,
title={SimpleStrat: Diversifying Language Model Generation with Stratification},
author={Wong, Justin and Orlovskiy, Yury and Luo, Michael and Seshia, Sanjit A. and Gonzalez, Joseph E.},
journal={arXiv preprint arXiv:2410.09038},
year={2024}
}
```
提供机构:
eduardofarina



