five

sourxv-17/llama3.1-medical-dataset

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sourxv-17/llama3.1-medical-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 153807514.8685528 num_examples: 56631 - name: test num_bytes: 8096275.923489889 num_examples: 2981 download_size: 69380635 dataset_size: 161903790.7920427 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: apache-2.0 task_categories: - question-answering - text-generation language: - en tags: - medical - qa - instruction-tuning - llm - healthcare - chatbot - mcq - education pretty_name: 'Medical Instruction Tuning Dataset (QA + MCQ) ' size_categories: - 10K<n<100K --- # 🏥 LLaMA 3.1 Medical Dataset A cleaned, filtered, and unified instruction-tuning dataset for fine-tuning LLaMA 3.1 8B on medical domain Q&A with safety guardrails. Built to train a conversational medical assistant that: - Answers health-related questions accurately - Refuses out-of-scope queries - Escalates emergencies to services - Never diagnoses or replaces a licensed physician --- ## Dataset Summary | Split | Examples | |---|---| | Train | 56,631 | | Test | 2,981 | | **Total** | **59,612** | All examples use a **single unified system prompt** — a deliberate design choice to ensure the model learns one consistent identity across all training data, guardrails, and inference. --- ## Sources | Source Dataset | Type | Role | |---|---|---| | `lavita/medical-qa-datasets` (ChatDoctor HealthcareMagic) | Conversational medical Q&A | Primary — real patient-doctor conversations | | `medmcqa` | MCQ with clinical explanations | Secondary — explanations only, MCQ-only answers filtered | | `pubmed_qa` (pqa_labeled) | Clinical evidence Q&A | Tertiary — evidence-based long answers | --- ## System Prompt Every example in this dataset uses exactly this system prompt — no variation: ``` You are a medical assistant trained to provide accurate, evidence-based health information. You only answer medical and health-related questions. For emergencies, always direct the user to call emergency services immediately. You cannot diagnose conditions or replace a licensed physician. Always recommend consulting a doctor for personal medical decisions. ``` This is intentional. Using a single consistent prompt across all training examples ensures the model learns one identity, which is critical for guardrail reliability at inference time. --- ## Format Each example is a pre-formatted LLaMA 3.1 chat template string stored in the `prompt` column: ``` <|start_header_id|>system<|end_header_id|> {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|> {question}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {answer}<|eot_id|> ``` > **Note:** The `\n\n` after each header token is required for `train_on_responses_only` masking to work correctly in Unsloth. Without it, the boundary detection silently fails and the model trains on the full sequence including user prompts. ### Columns | Column | Type | Description | |---|---|---| | `prompt` | `string` | Full pre-formatted LLaMA 3.1 chat prompt | | `instruction` | `string` | System prompt (identical across all rows) | | `input` | `string` | User question | | `output` | `string` | Assistant answer | --- ## Filtering Pipeline Raw data went through a multi-stage cleaning pipeline before being included: ### Stage 1 — Hard filter (binary pass/fail) - Question length ≥ 20 characters - Answer length ≥ 50 characters and ≤ 3,000 characters - MCQ-only answers dropped (e.g. `"A"`, `"The answer is B"`) — these teach the wrong output format for conversational Q&A - Boilerplate-heavy responses removed (e.g. excessive "Chat Doctor" mentions) - Answers with > 8 questions dropped (non-answers) - Unsafe content removed (self-harm, overdose instructions, etc.) - Must contain at least one medical keyword ### Stage 2 — Quality score (0–100, threshold ≥ 30) Each example is scored on: - **Length quality** (0–25 pts): question 30–500 chars, answer 100–2000 chars - **Medical vocabulary richness** (0–25 pts): presence of clinical terms - **Answer structure** (0–25 pts): numbered points, multi-line, multiple sentences - **Boilerplate penalty** (−8 pts each): "thank you for consulting", "please post a query", etc. ### Stage 3 — Prompt length gate - Formatted prompts > 3,000 characters dropped (VRAM safety for T4 fine-tuning) --- ## Usage ### Load the dataset ```python from datasets import load_dataset ds = load_dataset("sourxv-17/llama3.1-medical-dataset") train_data = ds["train"] test_data = ds["test"] print(f"Train: {len(train_data):,}") print(f"Test : {len(test_data):,}") # Inspect a sample print(train_data[0]["prompt"]) ``` --- ## Fine-Tuned Model This dataset was used to fine-tune: | Model | Link | |---|---| | LoRA adapters | `sourxv-17/llama3.1-medical-chatdoc-lora` | | GGUF (Q4_K_M) | `sourxv-17/llama3.1-medical-chatdoc-gguf` | **Training config:** - Base model: `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit` - Method: QLoRA (4-bit) + RSLoRA - LoRA rank: r=32, alpha=32 - Epochs: 2 - Effective batch size: 16 - Learning rate: 2e-4 (cosine decay) - `train_on_responses_only`: enabled - NEFTune noise alpha: 5 --- ## Safety & Limitations - ⚕️ **Not a medical device.** This dataset and any models trained on it are for research and informational purposes only. - 🚫 **Cannot replace professional medical advice.** Models trained here should always recommend consulting a licensed physician. - ⚠️ **Hallucination risk.** Medical LLMs can generate plausible-sounding but incorrect information. Always validate outputs against clinical sources. - 🛡️ **Guardrails are soft constraints.** Fine-tuning on safety examples improves but does not guarantee refusal behavior. A post-generation filter layer is strongly recommended for production deployment. - 📋 **India-context emergency numbers** (112, iCall 9152987821, Vandrevala 1860-2662-345) are used in guardrail training examples. Adapt for your deployment region. --- ## Dataset Creation **Preparation notebook:** `dataset_prep.ipynb` **Pipeline steps:** 1. Load 3 source datasets 2. Standardise to unified system prompt 3. Hard quality filter (binary) 4. Quality score filter (≥ 30/100) 5. Apply LLaMA 3.1 chat template with `\n\n` boundaries 6. Prompt length filter (≤ 3,000 chars) 7. Shuffle + single train/test split (95/5) 8. Push to Hub --- ## Citation If you use this dataset, please cite the original sources: ```bibtex @misc{sourxv17-medical-dataset-2025, title = {LLaMA 3.1 Medical Dataset}, author = {sourxv-17}, year = {2025}, url = {https://huggingface.co/datasets/sourxv-17/llama3.1-medical-dataset}, note = {Derived from ChatDoctor HealthcareMagic, MedMCQA, and PubMedQA} } @inproceedings{medmcqa, title = {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering}, author = {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan}, year = {2022} } @article{pubmedqa, title = {PubMedQA: A Dataset for Biomedical Research Question Answering}, author = {Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua}, year = {2019} } ``` --- ## License Apache 2.0 — see individual source datasets for their respective licenses. MedMCQA and PubMedQA are for research use. ChatDoctor HealthcareMagic is CC BY 4.0.
提供机构:
sourxv-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作