Khalilbraham/PKPD-Dataset
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Khalilbraham/PKPD-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: PKPD Dataset
language:
- en
task_categories:
- text-generation
size_categories:
- 10K<n<100K
source_datasets:
- extended
annotations_creators:
- machine-generated
---
# PKPD Dataset
## Dataset summary
This dataset is a pharmacokinetics/pharmacodynamics (PK/PD) and pharmacometrics corpus built for domain-adaptive pretraining (DAPT).
It was created from automatically downloadable biomedical literature using:
- PubMed search via NCBI E-utilities
- PMID to PMCID mapping via the official PMC id conversion service
- Europe PMC / PMC open-access full-text XML retrieval
- JATS/XML parsing and heuristic filtering
The current released corpus contains **27,990 documents** and approximately **109.4 million estimated tokens**.
This release contains only the **core PubMed/PMC open-access article corpus**. Optional FDA guidance and open-source repository documentation were implemented in the pipeline but are **not included** in the current dataset export.
## Scope
The search strategy targets:
- pharmacokinetics
- pharmacodynamics
- PK/PD modeling
- population PK/PD
- nonlinear mixed effects modeling
- NONMEM / Monolix / SAEM / FOCE / NLME
- PBPK
- exposure-response
- dose selection
- model-informed drug development
- clinical pharmacology
- covariate modeling
- Bayesian PKPD
The corpus is intended for:
- domain-adaptive pretraining
- continued pretraining of biomedical or general LLMs
- information retrieval / RAG experiments
- corpus analysis for pharmacometrics language
## Source data and collection pipeline
### Source systems
Primary source systems used for this release:
1. **PubMed / NCBI E-utilities**
2. **PMC ID conversion API**
3. **Europe PMC fullTextXML endpoint**
Excluded from this release:
- paywalled journal scraping
- copyrighted textbook scraping
- FDA guidance pages
- open-source repository docs
### Date range
- **Search period:** 2010-01-01 to 2026-03-12
### Query families
The PubMed search used five overlapping query families:
1. `pkpd_core`
2. `population_pkpd`
3. `nlme_platforms`
4. `pbpk`
5. `exposure_response_midd`
Per-query unique PMID counts before cross-query deduplication:
| Query family | Unique PMIDs |
|---|---:|
| `pkpd_core` | 145,533 |
| `population_pkpd` | 6,847 |
| `nlme_platforms` | 3,932 |
| `pbpk` | 4,951 |
| `exposure_response_midd` | 12,331 |
After deduplication across query families, the search yielded:
- **156,274 unique PMIDs**
### Retrieval and filtering stages
Pipeline totals:
1. PubMed search: **156,274 unique PMIDs**
2. PMID to PMCID mapping: **66,948 PMCIDs**
3. Europe PMC / PMC XML retrieved: **49,097 XML articles**
4. Parsed JATS records: **49,097**
5. Final kept DAPT documents: **27,990**
Retrieval outcomes:
- PMCIDs with XML successfully materialized locally: **49,097**
- PMCIDs mapped but not available through Europe PMC fullTextXML: **18,215**
- Fetch failures: **0** at the end of the completed run
Filtering outcomes:
- Parsed input docs: **49,097**
- Final kept docs: **27,990**
- Rejected for low relevance: **19,052**
- Rejected for too short length: **2,054**
- Rejected as duplicates: **1**
## Data fields
Each record in the final JSONL contains:
- `id`: document identifier, usually PMCID-based
- `source`: source group
- `title`: article title
- `text`: cleaned training text
Example schema:
```json
{
"id": "PMC10010492",
"source": "core_pubmed_pmc",
"title": "Integrative population pharmacokinetic/pharmacodynamic analysis of nemonoxacin capsule in Chinese patients with community-acquired pneumonia",
"text": "..."
}
```
## Split / repartition
Current files on disk:
- `final_merged_dapt.jsonl`: **27,990** records
- `train.jsonl`: **27,431** records
- `eval.jsonl`: **559** records
Split proportions:
- **Train:** 27,431 / 27,990 = **98.0%**
- **Validation:** 559 / 27,990 = **2.0%**
Source repartition in the final release:
| Source | Documents | Share |
|---|---:|---:|
| `core_pubmed_pmc` | 27,990 | 100% |
Character / token scale:
- Total characters: **437,602,093**
- Average characters per kept document: **15,638.38**
- Rough token estimate: **109,400,619**
## Text extraction details
The XML parser keeps article components most useful for PKPD DAPT:
- title
- abstract
- methods
- modeling
- statistical analysis
- results
- discussion
- conclusion
The parser drops low-value or non-training sections when possible:
- references
- acknowledgements
- funding boilerplate
- author contributions
- supplementary boilerplate
Whitespace is normalized, and some inline citation clutter is removed.
## Quality notes
This corpus was built with **high recall rather than high precision**. It is strong for:
- PK/PD language
- clinical pharmacology
- PBPK
- exposure-response
- dose optimization
- drug disposition and modeling methods
However, the query strategy is broad, and some retained articles are only **adjacent** to pharmacometrics rather than strictly within it. For example, some documents concern:
- broader translational pharmacology
- oncology therapeutics
- drug-protein binding
- formulation or delivery topics
This makes the dataset suitable for a **prototype DAPT corpus**, but not yet a perfectly clean pharmacometrics-only benchmark.
## Intended uses
Recommended uses:
- domain-adaptive pretraining for LLMs
- continued pretraining of Qwen/Llama/Mistral-style causal LMs
- corpus mining and keyword analysis
- retrieval experiments on PKPD literature
Not recommended as-is for:
- strict pharmacometrics benchmarking without extra curation
- legal redistribution assumptions without checking article-level terms
- clinical decision support
## Licensing and redistribution note
This dataset is derived from **PMC / Europe PMC open-access full-text XML** and related PubMed metadata, but the corpus should **not** be interpreted as having a single unified license automatically inherited across all articles.
Important note:
- PMC / Europe PMC accessibility does **not** guarantee identical downstream redistribution terms for every document.
- Before making the dataset public, article-level licensing and redistribution conditions should be reviewed carefully.
For conservative use, private hosting is recommended until licensing is fully audited.
## Reproducibility
The dataset was generated by the local pipeline in:
- `scripts/01_search_pubmed.py`
- `scripts/02_map_pmids_to_pmcids.py`
- `scripts/03_fetch_fulltext_xml.py`
- `scripts/04_parse_jats_xml.py`
- `scripts/05_build_dapt_jsonl.py`
- `scripts/08_merge_and_report.py`
Summary reports used for this card:
- `data/reports/pubmed_search_summary.json`
- `data/reports/fulltext_retrieval_report.json`
- `data/reports/parsed_xml_report.json`
- `data/reports/core_pubmed_build_report.json`
- `data/reports/corpus_summary.json`
## Loading example
```python
from datasets import load_dataset
ds = load_dataset("Khalilbraham/PKPD-Dataset")
print(ds)
print(ds["train"][0].keys())
```
## Suggested citation
If you use this dataset, cite:
1. PubMed / NCBI E-utilities
2. PMC / Europe PMC
3. The dataset repository itself
You may also cite the associated local corpus-building pipeline if released separately.
提供机构:
Khalilbraham



