Saminx22/medical_data_for_slm
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Saminx22/medical_data_for_slm
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
tags:
- medical
- biology
- pretraining
pretty_name: Medical SLM Pretraining Dataset
task_categories:
- text-generation
---
# 🏥 Medical SLM Pretraining Dataset Card
This dataset is a high-quality, cleaned collection of medical text designed for pretraining small language models (SLMs). It aggregates data from three primary authoritative sources, focusing on general medicine and clinical guidelines.
## 📊 Dataset Summary
- **Total Documents:** ~44,400
- **Estimated Tokens:** ~44.7 Million
- **Primary Language:** English
- **Configurations:**
- `documents`: Raw cleaned text records.
- `chunks`: Tokenized and packed 1024-token sequences for training.
## 🛠 Processing Pipeline
1. **Cleaning:** Removed boilerplate (copyrights, funding, URLs, DOIs) and normalized whitespace.
2. **Quality Filtering:** Removed documents that were too short (<300 chars), had low alphabetic ratios, or high digit/special character ratios.
3. **Deduplication:** Applied exact MD5 hashing to remove identical documents.
4. **Packing:** Greedy packing of documents into 1024-token chunks separated by `<|endoftext|>Requested.`
## 📚 Original Datasets & Credits
We gratefully acknowledge the creators of the original datasets used to compile this corpus:
### 1. PubMed Abstracts
- **Source:** [NCBI PubMed](https://huggingface.co/datasets/ncbi/pubmed)
- **Credit:** National Library of Medicine (NLM).
- **Usage:** Scientific abstracts summarizing biomedical research.
### 2. PMC Open Access (PMC OA)
- **Source:** [PMC Open Access Subset](https://huggingface.co/datasets/axiong/pmc_oa)
- **Credit:** National Institutes of Health (NIH) / PubMed Central.
- **Usage:** Full-text articles providing deep clinical context.
### 3. Clinical Guidelines
- **Source:** [EPFL-LLM Guidelines](https://huggingface.co/datasets/epfl-llm/guidelines)
- **Credit:** EPFL-LLM Team. Aggregated from WHO, CDC, NICE, and other health organizations.
- **Usage:** Authoritative medical standards and clinical practice protocols.
## ⚖️ License
This aggregate dataset is provided for research purposes. Users must adhere to the individual licenses of the source datasets (Creative Commons, NLM, etc.).
提供机构:
Saminx22



