sourxv-17/llama3.1-medical-dataset
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sourxv-17/llama3.1-medical-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: prompt
dtype: string
splits:
- name: train
num_bytes: 153807514.8685528
num_examples: 56631
- name: test
num_bytes: 8096275.923489889
num_examples: 2981
download_size: 69380635
dataset_size: 161903790.7920427
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: apache-2.0
task_categories:
- question-answering
- text-generation
language:
- en
tags:
- medical
- qa
- instruction-tuning
- llm
- healthcare
- chatbot
- mcq
- education
pretty_name: 'Medical Instruction Tuning Dataset (QA + MCQ) '
size_categories:
- 10K<n<100K
---
# 🏥 LLaMA 3.1 Medical Dataset
A cleaned, filtered, and unified instruction-tuning dataset for fine-tuning LLaMA 3.1 8B on medical domain Q&A with safety guardrails.
Built to train a conversational medical assistant that:
- Answers health-related questions accurately
- Refuses out-of-scope queries
- Escalates emergencies to services
- Never diagnoses or replaces a licensed physician
---
## Dataset Summary
| Split | Examples |
|---|---|
| Train | 56,631 |
| Test | 2,981 |
| **Total** | **59,612** |
All examples use a **single unified system prompt** — a deliberate design choice to ensure the model learns one consistent identity across all training data, guardrails, and inference.
---
## Sources
| Source Dataset | Type | Role |
|---|---|---|
| `lavita/medical-qa-datasets` (ChatDoctor HealthcareMagic) | Conversational medical Q&A | Primary — real patient-doctor conversations |
| `medmcqa` | MCQ with clinical explanations | Secondary — explanations only, MCQ-only answers filtered |
| `pubmed_qa` (pqa_labeled) | Clinical evidence Q&A | Tertiary — evidence-based long answers |
---
## System Prompt
Every example in this dataset uses exactly this system prompt — no variation:
```
You are a medical assistant trained to provide accurate, evidence-based health
information. You only answer medical and health-related questions. For emergencies,
always direct the user to call emergency services immediately. You cannot diagnose
conditions or replace a licensed physician. Always recommend consulting a doctor
for personal medical decisions.
```
This is intentional. Using a single consistent prompt across all training examples ensures the model learns one identity, which is critical for guardrail reliability at inference time.
---
## Format
Each example is a pre-formatted LLaMA 3.1 chat template string stored in the `prompt` column:
```
<|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{answer}<|eot_id|>
```
> **Note:** The `\n\n` after each header token is required for `train_on_responses_only` masking to work correctly in Unsloth. Without it, the boundary detection silently fails and the model trains on the full sequence including user prompts.
### Columns
| Column | Type | Description |
|---|---|---|
| `prompt` | `string` | Full pre-formatted LLaMA 3.1 chat prompt |
| `instruction` | `string` | System prompt (identical across all rows) |
| `input` | `string` | User question |
| `output` | `string` | Assistant answer |
---
## Filtering Pipeline
Raw data went through a multi-stage cleaning pipeline before being included:
### Stage 1 — Hard filter (binary pass/fail)
- Question length ≥ 20 characters
- Answer length ≥ 50 characters and ≤ 3,000 characters
- MCQ-only answers dropped (e.g. `"A"`, `"The answer is B"`) — these teach the wrong output format for conversational Q&A
- Boilerplate-heavy responses removed (e.g. excessive "Chat Doctor" mentions)
- Answers with > 8 questions dropped (non-answers)
- Unsafe content removed (self-harm, overdose instructions, etc.)
- Must contain at least one medical keyword
### Stage 2 — Quality score (0–100, threshold ≥ 30)
Each example is scored on:
- **Length quality** (0–25 pts): question 30–500 chars, answer 100–2000 chars
- **Medical vocabulary richness** (0–25 pts): presence of clinical terms
- **Answer structure** (0–25 pts): numbered points, multi-line, multiple sentences
- **Boilerplate penalty** (−8 pts each): "thank you for consulting", "please post a query", etc.
### Stage 3 — Prompt length gate
- Formatted prompts > 3,000 characters dropped (VRAM safety for T4 fine-tuning)
---
## Usage
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("sourxv-17/llama3.1-medical-dataset")
train_data = ds["train"]
test_data = ds["test"]
print(f"Train: {len(train_data):,}")
print(f"Test : {len(test_data):,}")
# Inspect a sample
print(train_data[0]["prompt"])
```
---
## Fine-Tuned Model
This dataset was used to fine-tune:
| Model | Link |
|---|---|
| LoRA adapters | `sourxv-17/llama3.1-medical-chatdoc-lora` |
| GGUF (Q4_K_M) | `sourxv-17/llama3.1-medical-chatdoc-gguf` |
**Training config:**
- Base model: `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit`
- Method: QLoRA (4-bit) + RSLoRA
- LoRA rank: r=32, alpha=32
- Epochs: 2
- Effective batch size: 16
- Learning rate: 2e-4 (cosine decay)
- `train_on_responses_only`: enabled
- NEFTune noise alpha: 5
---
## Safety & Limitations
- ⚕️ **Not a medical device.** This dataset and any models trained on it are for research and informational purposes only.
- 🚫 **Cannot replace professional medical advice.** Models trained here should always recommend consulting a licensed physician.
- ⚠️ **Hallucination risk.** Medical LLMs can generate plausible-sounding but incorrect information. Always validate outputs against clinical sources.
- 🛡️ **Guardrails are soft constraints.** Fine-tuning on safety examples improves but does not guarantee refusal behavior. A post-generation filter layer is strongly recommended for production deployment.
- 📋 **India-context emergency numbers** (112, iCall 9152987821, Vandrevala 1860-2662-345) are used in guardrail training examples. Adapt for your deployment region.
---
## Dataset Creation
**Preparation notebook:** `dataset_prep.ipynb`
**Pipeline steps:**
1. Load 3 source datasets
2. Standardise to unified system prompt
3. Hard quality filter (binary)
4. Quality score filter (≥ 30/100)
5. Apply LLaMA 3.1 chat template with `\n\n` boundaries
6. Prompt length filter (≤ 3,000 chars)
7. Shuffle + single train/test split (95/5)
8. Push to Hub
---
## Citation
If you use this dataset, please cite the original sources:
```bibtex
@misc{sourxv17-medical-dataset-2025,
title = {LLaMA 3.1 Medical Dataset},
author = {sourxv-17},
year = {2025},
url = {https://huggingface.co/datasets/sourxv-17/llama3.1-medical-dataset},
note = {Derived from ChatDoctor HealthcareMagic, MedMCQA, and PubMedQA}
}
@inproceedings{medmcqa,
title = {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering},
author = {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
year = {2022}
}
@article{pubmedqa,
title = {PubMedQA: A Dataset for Biomedical Research Question Answering},
author = {Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua},
year = {2019}
}
```
---
## License
Apache 2.0 — see individual source datasets for their respective licenses.
MedMCQA and PubMedQA are for research use. ChatDoctor HealthcareMagic is CC BY 4.0.
提供机构:
sourxv-17



