pdm95/open-cancer-kg
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/pdm95/open-cancer-kg
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
tags:
- cancer
- biomedical
- knowledge-graph
- embeddings
- pubmed
- clinical-trials
- drug-discovery
- literature-based-discovery
size_categories:
- 10K<n<100K
task_categories:
- other
- feature-extraction
pretty_name: Open Cancer Knowledge Graph (OCKG)
---
# Open Cancer Knowledge Graph (OCKG)
> *The first open, locally-runnable pipeline combining LLM-based structured extraction, vector embeddings, and cross-database linking of PubMed, ClinicalTrials.gov, and PubChem for cancer research gap detection - requiring no budget, no institutional access, and no proprietary tools.*
**Pipeline code on GitHub →** [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg)
---
## Dataset name
**OCKG - Open Cancer Knowledge Graph v1.0**
---
## The problem
Cancer research is fragmented across three major public databases that have never been systematically cross-referenced at the document level:
- **PubMed** - 35M+ paper abstracts, unstructured text
- **ClinicalTrials.gov** - 500k+ registered trials, siloed
- **PubChem** - 100M+ chemical compounds, disconnected from literature
A compound tested in a 1994 breast cancer paper may share a biological pathway with a 2021 lung trial that failed for an unrelated reason. Because vocabulary differs, journals differ, and no system links them semantically, that connection is never made.
This is the *undiscovered public knowledge* problem (Swanson, 1986). This pipeline solves it automatically, at scale, across all cancer types simultaneously.
---
## Dataset statistics (v1.0)
| Source | Documents | Status |
|--------|-----------|--------|
| PubMed | 22,338 | ✅ complete |
| ClinicalTrials.gov | 19,979 | ✅ complete |
| PubChem | 92 | ✅ complete |
| **Total** | **42,409** | ✅ |
Additional outputs (not released publicly):
- 200,000+ Q&A pairs for LLM fine-tuning (5 per document)
- 10,346 research gap hypotheses flagged by the pipeline
- 14,163 cross-source connections found between documents sharing biology but never citing each other
- 104 high-confidence connections where documents share compound + cancer type + pathway
Known limitations:
- 2 corrupted records excluded (pipeline interruption during writing)
- ~15% of records may have incomplete entity extraction (vague abstracts)
- `followed_up` field is an LLM judgment from abstract text alone, not citation-verified
- First 2,090 PubMed records processed with qwen2.5:14b, remainder with qwen2.5:7b
---
## How it differs from existing systems
| System | LLM extraction | Embeddings | Cross-DB | Gap detection | Open/free | Cancer-focused |
|--------|:-:|:-:|:-:|:-:|:-:|:-:|
| Open Targets | ❌ | ❌ | Partial | ❌ | Partial | Partial |
| SemMedDB | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| SPOKE | ❌ | ❌ | ✅ | ❌ | Partial | ❌ |
| BioGPT | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| iKraph | ✅ | ❌ | Partial | ❌ | ❌ | ❌ |
| PKG2.0 | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ |
| **OCKG (this work)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
No existing public system combines all six properties.
---
## Top entities in the corpus
**Top compounds:**
| Compound | Documents |
|----------|-----------|
| doxorubicin | 1,212 |
| paclitaxel | 578 |
| cisplatin | 542 |
| curcumin | 428 |
| chitosan | 330 |
| melatonin | 327 |
| hyaluronic acid | 263 |
| docetaxel | 253 |
| gemcitabine | 246 |
| PARP inhibitors | 240 |
**Top cancer types:**
| Cancer Type | Documents |
|-------------|-----------|
| breast cancer | 2,007 |
| breast neoplasms | 1,413 |
| colorectal cancer | 1,377 |
| prostate cancer | 773 |
| lung cancer | 686 |
| ovarian cancer | 659 |
| melanoma | 633 |
| hepatocellular carcinoma | 624 |
| lung neoplasms | 467 |
| non-small cell lung cancer | 453 |
---
## What each record contains
Every document - regardless of source - is structured into the same schema:
```json
{
"doc_id": "pubmed_38291045",
"source": "pubmed",
"title": "...",
"summary": "3-5 sentence plain-English summary",
"document_type": "research_paper",
"cancer_types": ["glioblastoma", "NSCLC"],
"pathways_mentioned": ["PI3K/AKT/mTOR", "apoptosis"],
"compounds": ["temozolomide", "bevacizumab"],
"genes_proteins": ["EGFR", "p53", "KRAS"],
"mechanism_of_action": "...",
"experimental_result": {
"effect": "inhibited tumor growth by 60%",
"model": "xenograft mouse",
"outcome": "positive",
"followed_up": false
},
"potential_connections": [
"Compound X blocks KRAS-G12C - never tested in pancreatic cancer"
],
"similar_terms": ["kinase inhibitor", "targeted therapy"],
"study_phase": "preclinical",
"data_quality": "high",
"embed_string": "...",
"embedding": [0.021, -0.034, "..."]
}
```
The `followed_up: false` flag marks findings the LLM judged as never built upon - research gap candidates. The `embedding` field is a 768-dimensional semantic fingerprint (nomic-embed-text) enabling cosine similarity search across the entire corpus regardless of vocabulary, journal, or decade.
Q&A pairs are not included in this public release.
---
## Cross-source connections found
After processing all three sources, the pipeline identified **14,163 cross-source connections** - documents from different databases sharing the same compound, cancer type, and biological pathway without citing each other. Of these, **104 are high-confidence** connections sharing compound + cancer type + pathway simultaneously.
Example finding:
```
Confidence: 0.75
pubmed → pubmed_37326467
trials → trial_NCT05372640
Shared compound: abemaciclib
Shared cancer: breast cancer
Shared pathway: CDK4/6 pathway
```
Another finding:
```
Confidence: 0.55
trial → NCT06328387
pubmed → 9 separate papers
Shared compound: chloroquine
Shared pathway: autophagy
```
Real-world example discovered by the pipeline:
> A completed clinical trial at MD Anderson (NCT00501410) tested cetuximab + dasatinib to overcome EGFR resistance in metastatic colorectal cancer. A separate PubMed paper (PMID 27636997) discovered that combining cetuximab with MEK1/2 inhibition creates a synthetic lethal effect in NRAS-mutant colorectal cancer - up to 1,300x more effective against resistant cells. Same cancer. Same drug. Same clinical problem. Different resistance mechanism. Neither cited the other.
---
## Setup
```bash
pip install requests tqdm
ollama pull qwen2.5:7b
ollama pull nomic-embed-text
```
Full pipeline code at: [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg)
Optional - free NCBI API key for higher rate limits (10 req/sec vs 3):
1. Register at https://www.ncbi.nlm.nih.gov/account/
2. Account Settings → API Key Management → Generate
3. Use: `NCBI_API_KEY=your_key python pipeline.py ...`
---
## Run order
```bash
# Test first
python pipeline.py --source pubmed --limit 100 --workers 2
# Full runs - fully resumable if interrupted
python pipeline.py --source pubmed --limit 50000 --workers 3
python pipeline.py --source trials --limit 20000 --workers 3
python pipeline.py --source pubchem --limit 10000 --workers 3
# Find cross-source connections
python pipeline.py --crossref
# Statistics
python pipeline.py --stats
```
---
## Actual performance measured
| Source | Docs | Time (qwen2.5:7b, RTX 4060 Ti 16GB) |
|--------|------|--------------------------------------|
| PubMed | 22,338 | ~55 hours |
| ClinicalTrials | 19,979 | ~68 hours |
| PubChem | 92 | ~2 hours |
Workers=3, power-limited to 125W for sustained operation. Total GPU runtime: 125+ hours.
---
## Query your graph
```bash
# Find all documents mentioning a compound
python query.py --entity compound "sotorasib"
# Find cross-source connections
python query.py --connections "sotorasib"
# Semantic search
python query.py --search "KRAS mutation untested compound pancreatic"
# Export connections to CSV
python query.py --export-connections connections.csv
```
---
## Data sources and licensing
All source data is public domain:
| Source | Owner | License |
|--------|-------|---------|
| PubMed abstracts | US National Library of Medicine | Public domain |
| ClinicalTrials.gov | US federal government | Public domain |
| PubChem | NIH | Public domain |
This dataset (extracted JSON records + embeddings) is released under **CC BY 4.0** - free to use with attribution.
Pipeline code is released under **MIT License**.
Q&A pairs are not released publicly.
---
## Academic context
This work is a contribution to **literature-based discovery (LBD)** and the *undiscovered public knowledge* problem (Swanson, 1986).
**Related work:** Arsenyan et al. 2024 (BioNLP), iKraph 2023, PubMed KG 2.0 (Xu et al. 2024), Borchert et al. 2024, Sarol et al. 2024, BioStrataKG 2024.
**Affiliation:** Independent student research, Tilburg University, The Netherlands.
---
## Citation
```bibtex
@dataset{ockg2026,
title = {Open Cancer Knowledge Graph (OCKG) v1.0},
author = {Pocatilu Daniel Mihai},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/pdm95/open-cancer-kg}
}
```
---
*Built on a student's GPU. Costs nothing to run. Free for any researcher anywhere.*
license: 知识共享署名4.0协议(CC BY 4.0)
language:
- 英语
tags:
- 癌症
- 生物医学
- 知识图谱(knowledge-graph)
- 向量嵌入(vector embeddings)
- PubMed
- 临床试验(clinical-trials)
- 药物发现(drug-discovery)
- 基于文献的发现(literature-based-discovery)
size_categories:
- 10000 < 样本量 < 100000
task_categories:
- 其他
- 特征提取(feature-extraction)
pretty_name: 开放癌症知识图谱(Open Cancer Knowledge Graph, OCKG)
---
# 开放癌症知识图谱(Open Cancer Knowledge Graph, OCKG)
> *首个可本地运行的开源流程,结合基于大语言模型(Large Language Model, LLM)的结构化抽取、向量嵌入(vector embeddings)以及PubMed、ClinicalTrials.gov与PubChem的跨数据库关联,用于癌症研究缺口检测——无需预算、无需机构权限、无需专有工具。*
**GitHub 流程代码 →** [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg)
---
## 数据集名称
**OCKG - 开放癌症知识图谱 v1.0**
---
## 研究背景与问题
癌症研究分散于三大主流公共数据库,此前从未在文档层面进行系统性交叉对照:
- **PubMed**:超3500万篇论文摘要,非结构化文本
- **ClinicalTrials.gov**:超50万项注册临床试验,数据孤岛
- **PubChem**:超1亿种化合物,与文献无关联
1994年乳腺癌论文中测试的化合物,可能与2021年某因非相关原因失败的肺癌临床试验共享生物学通路。由于术语体系差异、期刊来源不同,且无语义关联系统,这类关联从未被发现。这就是**未被发掘的公共知识**问题(Swanson, 1986)。本流程可自动、大规模地同时解决所有癌症类型的该类问题。
---
## 数据集统计信息(v1.0版本)
| 来源 | 文档数量 | 状态 |
|------------|----------|--------|
| PubMed | 22,338 | ✅ 已完成 |
| ClinicalTrials.gov | 19,979 | ✅ 已完成 |
| PubChem | 92 | ✅ 已完成 |
| **总计** | **42,409** | ✅ |
额外产出(未公开发布):
- 超20万对用于大语言模型微调的问答对(每篇文档对应5对)
- 流程标记的10,346项研究缺口假设
- 14,163项跨源关联:共享生物学特征但未相互引用的文档间关联
- 104项高置信度关联:同时共享化合物、癌症类型与通路的文档关联
已知局限性:
- 排除2条损坏的记录(写入过程中流程中断导致)
- 约15%的记录可能存在实体抽取不全的问题(摘要表述模糊)
- `followed_up`字段为大语言模型仅基于摘要文本做出的判断,未经过引用验证
- 前2,090条PubMed记录使用qwen2.5:14b模型处理,其余记录使用qwen2.5:7b模型处理
---
## 与现有系统的对比
| 系统 | 大语言模型抽取 | 向量嵌入 | 跨数据库关联 | 缺口检测 | 开源/免费 | 聚焦癌症 |
|--------------|:--------------:|:--------:|:------------:|:--------:|:---------:|:--------:|
| 开放靶点平台 | ❌ | ❌ | Partial(部分支持) | ❌ | Partial(部分支持) | Partial(部分支持) |
| SemMedDB | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| SPOKE | ❌ | ❌ | ✅ | ❌ | Partial(部分支持) | ❌ |
| BioGPT | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| iKraph | ✅ | ❌ | Partial(部分支持) | ❌ | ❌ | ❌ |
| PKG2.0 | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ |
| **OCKG(本研究)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
目前尚无公开系统同时具备上述六项特性。
---
## 语料库高频实体
### 高频化合物
| 化合物名称 | 文档数量 |
|------------------|----------|
| 多柔比星(doxorubicin) | 1,212 |
| 紫杉醇(paclitaxel) | 578 |
| 顺铂(cisplatin) | 542 |
| 姜黄素(curcumin) | 428 |
| 壳聚糖(chitosan) | 330 |
| 褪黑素(melatonin) | 327 |
| 透明质酸(hyaluronic acid) | 263 |
| 多西他赛(docetaxel) | 253 |
| 吉西他滨(gemcitabine) | 246 |
| PARP抑制剂(PARP inhibitors) | 240 |
### 高频癌症类型
| 癌症类型 | 文档数量 |
|------------------------|----------|
| 乳腺癌(breast cancer) | 2,007 |
| 乳腺肿瘤(breast neoplasms) | 1,413 |
| 结直肠癌(colorectal cancer) | 1,377 |
| 前列腺癌(prostate cancer) | 773 |
| 肺癌(lung cancer) | 686 |
| 卵巢癌(ovarian cancer) | 659 |
| 黑色素瘤(melanoma) | 633 |
| 肝细胞癌(hepatocellular carcinoma) | 624 |
| 肺肿瘤(lung neoplasms) | 467 |
| 非小细胞肺癌(non-small cell lung cancer) | 453 |
---
## 单条记录结构
所有文档无论来源,均遵循统一的结构化schema:
json
{
"doc_id": "pubmed_38291045",
"source": "pubmed",
"title": "...",
"summary": "3-5句通俗英文摘要",
"document_type": "research_paper",
"cancer_types": ["glioblastoma", "NSCLC"],
"pathways_mentioned": ["PI3K/AKT/mTOR", "apoptosis"],
"compounds": ["temozolomide", "bevacizumab"],
"genes_proteins": ["EGFR", "p53", "KRAS"],
"mechanism_of_action": "...",
"experimental_result": {
"effect": "肿瘤生长抑制率达60%",
"model": "异种移植小鼠模型",
"outcome": "阳性",
"followed_up": false
},
"potential_connections": [
"化合物X阻断KRAS-G12C——尚未在胰腺癌中测试"
],
"similar_terms": ["激酶抑制剂", "靶向治疗"],
"study_phase": "preclinical",
"data_quality": "high",
"embed_string": "...",
"embedding": [0.021, -0.034, "..."]
}
`followed_up: false` 标记为大语言模型判定未被后续研究跟进的发现——即研究缺口候选。`embedding` 字段为768维语义指纹(使用nomic-embed-text模型生成),支持跨语料库的余弦相似度搜索,不受术语体系、期刊来源或发表年代限制。本公开发布版本不包含问答对。
---
## 跨源关联发现情况
处理三大数据源后,流程共识别出**14,163项跨源关联**——来自不同数据库的文档共享同一化合物、癌症类型与生物学通路,但未相互引用。其中**104项为高置信度关联**,同时共享化合物、癌症类型与通路。
示例发现:
置信度:0.75
来源:PubMed → pubmed_37326467
来源:临床试验 → trial_NCT05372640
共享化合物:阿贝西利(abemaciclib)
共享癌症:乳腺癌
共享通路:CDK4/6通路
另一项发现:
置信度:0.55
来源:临床试验 → NCT06328387
来源:PubMed → 9篇独立论文
共享化合物:氯喹(chloroquine)
共享通路:细胞自噬通路
本流程发现的真实世界案例:
> 美国安德森癌症中心的一项已完成临床试验(NCT00501410)测试了西妥昔单抗联合达沙替尼,以克服转移性结直肠癌的EGFR耐药性。另一篇PubMed论文(PMID 27636997)发现,西妥昔单抗联合MEK1/2抑制剂可在NRAS突变结直肠癌中产生合成致死效应——对耐药细胞的杀伤效果最高可达1300倍。二者针对同一癌症、同一药物、同一临床问题,但耐药机制不同,且未相互引用。
---
## 部署流程
bash
pip install requests tqdm
ollama pull qwen2.5:7b
ollama pull nomic-embed-text
完整流程代码见:[github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg)
可选:免费获取NCBI API密钥以提高请求速率(10请求/秒 vs 默认3请求/秒)
1. 在https://www.ncbi.nlm.nih.gov/account/ 注册账号
2. 进入账户设置 → API密钥管理 → 生成密钥
3. 使用方式:`NCBI_API_KEY=your_key python pipeline.py ...`
---
## 运行流程
bash
# 先进行测试
python pipeline.py --source pubmed --limit 100 --workers 2
# 完整运行——若中断可恢复执行
python pipeline.py --source pubmed --limit 50000 --workers 3
python pipeline.py --source trials --limit 20000 --workers 3
python pipeline.py --source pubchem --limit 10000 --workers 3
# 查找跨源关联
python pipeline.py --crossref
# 查看统计信息
python pipeline.py --stats
---
## 实测性能
| 数据源 | 文档数量 | 耗时(使用qwen2.5:7b模型,RTX 4060 Ti 16GB显卡) |
|----------------|----------|--------------------------------------------------|
| PubMed | 22,338 | ~55小时 |
| ClinicalTrials | 19,979 | ~68小时 |
| PubChem | 92 | ~2小时 |
使用3个工作进程,功耗限制为125W以保证稳定运行。总GPU运行时长:125+小时。
---
## 图谱查询
bash
# 查找所有提及某化合物的文档
python query.py --entity compound "sotorasib"
# 查找跨源关联
python query.py --connections "sotorasib"
# 语义搜索
python query.py --search "KRAS mutation untested compound pancreatic"
# 将关联导出为CSV文件
python query.py --export-connections connections.csv
---
## 数据源与授权
所有源数据均为公共领域内容:
| 数据源 | 所属方 | 授权协议 |
|----------------|----------------------|------------------------|
| PubMed摘要 | 美国国家医学图书馆 | 公共领域 |
| ClinicalTrials.gov | 美国联邦政府 | 公共领域 |
| PubChem | 美国国立卫生研究院 | 公共领域 |
本数据集(抽取得到的JSON记录与嵌入向量)采用**知识共享署名4.0协议(CC BY 4.0)**发布——可免费使用,需注明来源。流程代码采用**MIT许可证(MIT License)**发布。问答对未公开发布。
---
## 学术背景
本研究为**基于文献的发现(literature-based discovery, LBD)**与**未被发掘的公共知识**问题(Swanson, 1986)的一项贡献。
**相关研究**:Arsenyan等人2024(生物自然语言处理领域)、iKraph 2023、PubMed KG 2.0(Xu等人2024)、Borchert等人2024、Sarol等人2024、BioStrataKG 2024。
**所属机构**:荷兰蒂尔堡大学,独立学生研究项目。
---
## 引用格式
bibtex
@dataset{ockg2026,
title = {Open Cancer Knowledge Graph (OCKG) v1.0},
author = {Pocatilu Daniel Mihai},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/pdm95/open-cancer-kg}
}
---
*本项目基于学生个人显卡搭建。运行无需任何成本。全球任何研究者均可免费使用。*
提供机构:
pdm95



