five

Saminx22/medical_data_for_slm

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Saminx22/medical_data_for_slm
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other tags: - medical - biology - pretraining pretty_name: Medical SLM Pretraining Dataset task_categories: - text-generation --- # 🏥 Medical SLM Pretraining Dataset Card This dataset is a high-quality, cleaned collection of medical text designed for pretraining small language models (SLMs). It aggregates data from three primary authoritative sources, focusing on general medicine and clinical guidelines. ## 📊 Dataset Summary - **Total Documents:** ~44,400 - **Estimated Tokens:** ~44.7 Million - **Primary Language:** English - **Configurations:** - `documents`: Raw cleaned text records. - `chunks`: Tokenized and packed 1024-token sequences for training. ## 🛠 Processing Pipeline 1. **Cleaning:** Removed boilerplate (copyrights, funding, URLs, DOIs) and normalized whitespace. 2. **Quality Filtering:** Removed documents that were too short (<300 chars), had low alphabetic ratios, or high digit/special character ratios. 3. **Deduplication:** Applied exact MD5 hashing to remove identical documents. 4. **Packing:** Greedy packing of documents into 1024-token chunks separated by `<|endoftext|>Requested.` ## 📚 Original Datasets & Credits We gratefully acknowledge the creators of the original datasets used to compile this corpus: ### 1. PubMed Abstracts - **Source:** [NCBI PubMed](https://huggingface.co/datasets/ncbi/pubmed) - **Credit:** National Library of Medicine (NLM). - **Usage:** Scientific abstracts summarizing biomedical research. ### 2. PMC Open Access (PMC OA) - **Source:** [PMC Open Access Subset](https://huggingface.co/datasets/axiong/pmc_oa) - **Credit:** National Institutes of Health (NIH) / PubMed Central. - **Usage:** Full-text articles providing deep clinical context. ### 3. Clinical Guidelines - **Source:** [EPFL-LLM Guidelines](https://huggingface.co/datasets/epfl-llm/guidelines) - **Credit:** EPFL-LLM Team. Aggregated from WHO, CDC, NICE, and other health organizations. - **Usage:** Authoritative medical standards and clinical practice protocols. ## ⚖️ License This aggregate dataset is provided for research purposes. Users must adhere to the individual licenses of the source datasets (Creative Commons, NLM, etc.).
提供机构:
Saminx22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作