five

Khalilbraham/PKPD-Dataset

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Khalilbraham/PKPD-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: PKPD Dataset language: - en task_categories: - text-generation size_categories: - 10K<n<100K source_datasets: - extended annotations_creators: - machine-generated --- # PKPD Dataset ## Dataset summary This dataset is a pharmacokinetics/pharmacodynamics (PK/PD) and pharmacometrics corpus built for domain-adaptive pretraining (DAPT). It was created from automatically downloadable biomedical literature using: - PubMed search via NCBI E-utilities - PMID to PMCID mapping via the official PMC id conversion service - Europe PMC / PMC open-access full-text XML retrieval - JATS/XML parsing and heuristic filtering The current released corpus contains **27,990 documents** and approximately **109.4 million estimated tokens**. This release contains only the **core PubMed/PMC open-access article corpus**. Optional FDA guidance and open-source repository documentation were implemented in the pipeline but are **not included** in the current dataset export. ## Scope The search strategy targets: - pharmacokinetics - pharmacodynamics - PK/PD modeling - population PK/PD - nonlinear mixed effects modeling - NONMEM / Monolix / SAEM / FOCE / NLME - PBPK - exposure-response - dose selection - model-informed drug development - clinical pharmacology - covariate modeling - Bayesian PKPD The corpus is intended for: - domain-adaptive pretraining - continued pretraining of biomedical or general LLMs - information retrieval / RAG experiments - corpus analysis for pharmacometrics language ## Source data and collection pipeline ### Source systems Primary source systems used for this release: 1. **PubMed / NCBI E-utilities** 2. **PMC ID conversion API** 3. **Europe PMC fullTextXML endpoint** Excluded from this release: - paywalled journal scraping - copyrighted textbook scraping - FDA guidance pages - open-source repository docs ### Date range - **Search period:** 2010-01-01 to 2026-03-12 ### Query families The PubMed search used five overlapping query families: 1. `pkpd_core` 2. `population_pkpd` 3. `nlme_platforms` 4. `pbpk` 5. `exposure_response_midd` Per-query unique PMID counts before cross-query deduplication: | Query family | Unique PMIDs | |---|---:| | `pkpd_core` | 145,533 | | `population_pkpd` | 6,847 | | `nlme_platforms` | 3,932 | | `pbpk` | 4,951 | | `exposure_response_midd` | 12,331 | After deduplication across query families, the search yielded: - **156,274 unique PMIDs** ### Retrieval and filtering stages Pipeline totals: 1. PubMed search: **156,274 unique PMIDs** 2. PMID to PMCID mapping: **66,948 PMCIDs** 3. Europe PMC / PMC XML retrieved: **49,097 XML articles** 4. Parsed JATS records: **49,097** 5. Final kept DAPT documents: **27,990** Retrieval outcomes: - PMCIDs with XML successfully materialized locally: **49,097** - PMCIDs mapped but not available through Europe PMC fullTextXML: **18,215** - Fetch failures: **0** at the end of the completed run Filtering outcomes: - Parsed input docs: **49,097** - Final kept docs: **27,990** - Rejected for low relevance: **19,052** - Rejected for too short length: **2,054** - Rejected as duplicates: **1** ## Data fields Each record in the final JSONL contains: - `id`: document identifier, usually PMCID-based - `source`: source group - `title`: article title - `text`: cleaned training text Example schema: ```json { "id": "PMC10010492", "source": "core_pubmed_pmc", "title": "Integrative population pharmacokinetic/pharmacodynamic analysis of nemonoxacin capsule in Chinese patients with community-acquired pneumonia", "text": "..." } ``` ## Split / repartition Current files on disk: - `final_merged_dapt.jsonl`: **27,990** records - `train.jsonl`: **27,431** records - `eval.jsonl`: **559** records Split proportions: - **Train:** 27,431 / 27,990 = **98.0%** - **Validation:** 559 / 27,990 = **2.0%** Source repartition in the final release: | Source | Documents | Share | |---|---:|---:| | `core_pubmed_pmc` | 27,990 | 100% | Character / token scale: - Total characters: **437,602,093** - Average characters per kept document: **15,638.38** - Rough token estimate: **109,400,619** ## Text extraction details The XML parser keeps article components most useful for PKPD DAPT: - title - abstract - methods - modeling - statistical analysis - results - discussion - conclusion The parser drops low-value or non-training sections when possible: - references - acknowledgements - funding boilerplate - author contributions - supplementary boilerplate Whitespace is normalized, and some inline citation clutter is removed. ## Quality notes This corpus was built with **high recall rather than high precision**. It is strong for: - PK/PD language - clinical pharmacology - PBPK - exposure-response - dose optimization - drug disposition and modeling methods However, the query strategy is broad, and some retained articles are only **adjacent** to pharmacometrics rather than strictly within it. For example, some documents concern: - broader translational pharmacology - oncology therapeutics - drug-protein binding - formulation or delivery topics This makes the dataset suitable for a **prototype DAPT corpus**, but not yet a perfectly clean pharmacometrics-only benchmark. ## Intended uses Recommended uses: - domain-adaptive pretraining for LLMs - continued pretraining of Qwen/Llama/Mistral-style causal LMs - corpus mining and keyword analysis - retrieval experiments on PKPD literature Not recommended as-is for: - strict pharmacometrics benchmarking without extra curation - legal redistribution assumptions without checking article-level terms - clinical decision support ## Licensing and redistribution note This dataset is derived from **PMC / Europe PMC open-access full-text XML** and related PubMed metadata, but the corpus should **not** be interpreted as having a single unified license automatically inherited across all articles. Important note: - PMC / Europe PMC accessibility does **not** guarantee identical downstream redistribution terms for every document. - Before making the dataset public, article-level licensing and redistribution conditions should be reviewed carefully. For conservative use, private hosting is recommended until licensing is fully audited. ## Reproducibility The dataset was generated by the local pipeline in: - `scripts/01_search_pubmed.py` - `scripts/02_map_pmids_to_pmcids.py` - `scripts/03_fetch_fulltext_xml.py` - `scripts/04_parse_jats_xml.py` - `scripts/05_build_dapt_jsonl.py` - `scripts/08_merge_and_report.py` Summary reports used for this card: - `data/reports/pubmed_search_summary.json` - `data/reports/fulltext_retrieval_report.json` - `data/reports/parsed_xml_report.json` - `data/reports/core_pubmed_build_report.json` - `data/reports/corpus_summary.json` ## Loading example ```python from datasets import load_dataset ds = load_dataset("Khalilbraham/PKPD-Dataset") print(ds) print(ds["train"][0].keys()) ``` ## Suggested citation If you use this dataset, cite: 1. PubMed / NCBI E-utilities 2. PMC / Europe PMC 3. The dataset repository itself You may also cite the associated local corpus-building pipeline if released separately.
提供机构:
Khalilbraham
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作