pdm95/open-cancer-kg

Name: pdm95/open-cancer-kg
Creator: pdm95
Published: 2026-04-09 09:28:26
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/pdm95/open-cancer-kg

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - cancer - biomedical - knowledge-graph - embeddings - pubmed - clinical-trials - drug-discovery - literature-based-discovery size_categories: - 10K<n<100K task_categories: - other - feature-extraction pretty_name: Open Cancer Knowledge Graph (OCKG) --- # Open Cancer Knowledge Graph (OCKG) > *The first open, locally-runnable pipeline combining LLM-based structured extraction, vector embeddings, and cross-database linking of PubMed, ClinicalTrials.gov, and PubChem for cancer research gap detection - requiring no budget, no institutional access, and no proprietary tools.* **Pipeline code on GitHub →** [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) --- ## Dataset name **OCKG - Open Cancer Knowledge Graph v1.0** --- ## The problem Cancer research is fragmented across three major public databases that have never been systematically cross-referenced at the document level: - **PubMed** - 35M+ paper abstracts, unstructured text - **ClinicalTrials.gov** - 500k+ registered trials, siloed - **PubChem** - 100M+ chemical compounds, disconnected from literature A compound tested in a 1994 breast cancer paper may share a biological pathway with a 2021 lung trial that failed for an unrelated reason. Because vocabulary differs, journals differ, and no system links them semantically, that connection is never made. This is the *undiscovered public knowledge* problem (Swanson, 1986). This pipeline solves it automatically, at scale, across all cancer types simultaneously. --- ## Dataset statistics (v1.0) | Source | Documents | Status | |--------|-----------|--------| | PubMed | 22,338 | ✅ complete | | ClinicalTrials.gov | 19,979 | ✅ complete | | PubChem | 92 | ✅ complete | | **Total** | **42,409** | ✅ | Additional outputs (not released publicly): - 200,000+ Q&A pairs for LLM fine-tuning (5 per document) - 10,346 research gap hypotheses flagged by the pipeline - 14,163 cross-source connections found between documents sharing biology but never citing each other - 104 high-confidence connections where documents share compound + cancer type + pathway Known limitations: - 2 corrupted records excluded (pipeline interruption during writing) - ~15% of records may have incomplete entity extraction (vague abstracts) - `followed_up` field is an LLM judgment from abstract text alone, not citation-verified - First 2,090 PubMed records processed with qwen2.5:14b, remainder with qwen2.5:7b --- ## How it differs from existing systems | System | LLM extraction | Embeddings | Cross-DB | Gap detection | Open/free | Cancer-focused | |--------|:-:|:-:|:-:|:-:|:-:|:-:| | Open Targets | ❌ | ❌ | Partial | ❌ | Partial | Partial | | SemMedDB | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | | SPOKE | ❌ | ❌ | ✅ | ❌ | Partial | ❌ | | BioGPT | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | | iKraph | ✅ | ❌ | Partial | ❌ | ❌ | ❌ | | PKG2.0 | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | | **OCKG (this work)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | No existing public system combines all six properties. --- ## Top entities in the corpus **Top compounds:** | Compound | Documents | |----------|-----------| | doxorubicin | 1,212 | | paclitaxel | 578 | | cisplatin | 542 | | curcumin | 428 | | chitosan | 330 | | melatonin | 327 | | hyaluronic acid | 263 | | docetaxel | 253 | | gemcitabine | 246 | | PARP inhibitors | 240 | **Top cancer types:** | Cancer Type | Documents | |-------------|-----------| | breast cancer | 2,007 | | breast neoplasms | 1,413 | | colorectal cancer | 1,377 | | prostate cancer | 773 | | lung cancer | 686 | | ovarian cancer | 659 | | melanoma | 633 | | hepatocellular carcinoma | 624 | | lung neoplasms | 467 | | non-small cell lung cancer | 453 | --- ## What each record contains Every document - regardless of source - is structured into the same schema: ```json { "doc_id": "pubmed_38291045", "source": "pubmed", "title": "...", "summary": "3-5 sentence plain-English summary", "document_type": "research_paper", "cancer_types": ["glioblastoma", "NSCLC"], "pathways_mentioned": ["PI3K/AKT/mTOR", "apoptosis"], "compounds": ["temozolomide", "bevacizumab"], "genes_proteins": ["EGFR", "p53", "KRAS"], "mechanism_of_action": "...", "experimental_result": { "effect": "inhibited tumor growth by 60%", "model": "xenograft mouse", "outcome": "positive", "followed_up": false }, "potential_connections": [ "Compound X blocks KRAS-G12C - never tested in pancreatic cancer" ], "similar_terms": ["kinase inhibitor", "targeted therapy"], "study_phase": "preclinical", "data_quality": "high", "embed_string": "...", "embedding": [0.021, -0.034, "..."] } ``` The `followed_up: false` flag marks findings the LLM judged as never built upon - research gap candidates. The `embedding` field is a 768-dimensional semantic fingerprint (nomic-embed-text) enabling cosine similarity search across the entire corpus regardless of vocabulary, journal, or decade. Q&A pairs are not included in this public release. --- ## Cross-source connections found After processing all three sources, the pipeline identified **14,163 cross-source connections** - documents from different databases sharing the same compound, cancer type, and biological pathway without citing each other. Of these, **104 are high-confidence** connections sharing compound + cancer type + pathway simultaneously. Example finding: ``` Confidence: 0.75 pubmed → pubmed_37326467 trials → trial_NCT05372640 Shared compound: abemaciclib Shared cancer: breast cancer Shared pathway: CDK4/6 pathway ``` Another finding: ``` Confidence: 0.55 trial → NCT06328387 pubmed → 9 separate papers Shared compound: chloroquine Shared pathway: autophagy ``` Real-world example discovered by the pipeline: > A completed clinical trial at MD Anderson (NCT00501410) tested cetuximab + dasatinib to overcome EGFR resistance in metastatic colorectal cancer. A separate PubMed paper (PMID 27636997) discovered that combining cetuximab with MEK1/2 inhibition creates a synthetic lethal effect in NRAS-mutant colorectal cancer - up to 1,300x more effective against resistant cells. Same cancer. Same drug. Same clinical problem. Different resistance mechanism. Neither cited the other. --- ## Setup ```bash pip install requests tqdm ollama pull qwen2.5:7b ollama pull nomic-embed-text ``` Full pipeline code at: [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) Optional - free NCBI API key for higher rate limits (10 req/sec vs 3): 1. Register at https://www.ncbi.nlm.nih.gov/account/ 2. Account Settings → API Key Management → Generate 3. Use: `NCBI_API_KEY=your_key python pipeline.py ...` --- ## Run order ```bash # Test first python pipeline.py --source pubmed --limit 100 --workers 2 # Full runs - fully resumable if interrupted python pipeline.py --source pubmed --limit 50000 --workers 3 python pipeline.py --source trials --limit 20000 --workers 3 python pipeline.py --source pubchem --limit 10000 --workers 3 # Find cross-source connections python pipeline.py --crossref # Statistics python pipeline.py --stats ``` --- ## Actual performance measured | Source | Docs | Time (qwen2.5:7b, RTX 4060 Ti 16GB) | |--------|------|--------------------------------------| | PubMed | 22,338 | ~55 hours | | ClinicalTrials | 19,979 | ~68 hours | | PubChem | 92 | ~2 hours | Workers=3, power-limited to 125W for sustained operation. Total GPU runtime: 125+ hours. --- ## Query your graph ```bash # Find all documents mentioning a compound python query.py --entity compound "sotorasib" # Find cross-source connections python query.py --connections "sotorasib" # Semantic search python query.py --search "KRAS mutation untested compound pancreatic" # Export connections to CSV python query.py --export-connections connections.csv ``` --- ## Data sources and licensing All source data is public domain: | Source | Owner | License | |--------|-------|---------| | PubMed abstracts | US National Library of Medicine | Public domain | | ClinicalTrials.gov | US federal government | Public domain | | PubChem | NIH | Public domain | This dataset (extracted JSON records + embeddings) is released under **CC BY 4.0** - free to use with attribution. Pipeline code is released under **MIT License**. Q&A pairs are not released publicly. --- ## Academic context This work is a contribution to **literature-based discovery (LBD)** and the *undiscovered public knowledge* problem (Swanson, 1986). **Related work:** Arsenyan et al. 2024 (BioNLP), iKraph 2023, PubMed KG 2.0 (Xu et al. 2024), Borchert et al. 2024, Sarol et al. 2024, BioStrataKG 2024. **Affiliation:** Independent student research, Tilburg University, The Netherlands. --- ## Citation ```bibtex @dataset{ockg2026, title = {Open Cancer Knowledge Graph (OCKG) v1.0}, author = {Pocatilu Daniel Mihai}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/pdm95/open-cancer-kg} } ``` --- *Built on a student's GPU. Costs nothing to run. Free for any researcher anywhere.*

license: 知识共享署名4.0协议（CC BY 4.0） language: - 英语 tags: - 癌症 - 生物医学 - 知识图谱（knowledge-graph） - 向量嵌入（vector embeddings） - PubMed - 临床试验（clinical-trials） - 药物发现（drug-discovery） - 基于文献的发现（literature-based-discovery） size_categories: - 10000 < 样本量 < 100000 task_categories: - 其他 - 特征提取（feature-extraction） pretty_name: 开放癌症知识图谱（Open Cancer Knowledge Graph, OCKG） --- # 开放癌症知识图谱（Open Cancer Knowledge Graph, OCKG） > *首个可本地运行的开源流程，结合基于大语言模型（Large Language Model, LLM）的结构化抽取、向量嵌入（vector embeddings）以及PubMed、ClinicalTrials.gov与PubChem的跨数据库关联，用于癌症研究缺口检测——无需预算、无需机构权限、无需专有工具。* **GitHub 流程代码 →** [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) --- ## 数据集名称 **OCKG - 开放癌症知识图谱 v1.0** --- ## 研究背景与问题癌症研究分散于三大主流公共数据库，此前从未在文档层面进行系统性交叉对照： - **PubMed**：超3500万篇论文摘要，非结构化文本 - **ClinicalTrials.gov**：超50万项注册临床试验，数据孤岛 - **PubChem**：超1亿种化合物，与文献无关联 1994年乳腺癌论文中测试的化合物，可能与2021年某因非相关原因失败的肺癌临床试验共享生物学通路。由于术语体系差异、期刊来源不同，且无语义关联系统，这类关联从未被发现。这就是**未被发掘的公共知识**问题（Swanson, 1986）。本流程可自动、大规模地同时解决所有癌症类型的该类问题。 --- ## 数据集统计信息（v1.0版本） | 来源 | 文档数量 | 状态 | |------------|----------|--------| | PubMed | 22,338 | ✅ 已完成 | | ClinicalTrials.gov | 19,979 | ✅ 已完成 | | PubChem | 92 | ✅ 已完成 | | **总计** | **42,409** | ✅ | 额外产出（未公开发布）： - 超20万对用于大语言模型微调的问答对（每篇文档对应5对） - 流程标记的10,346项研究缺口假设 - 14,163项跨源关联：共享生物学特征但未相互引用的文档间关联 - 104项高置信度关联：同时共享化合物、癌症类型与通路的文档关联已知局限性： - 排除2条损坏的记录（写入过程中流程中断导致） - 约15%的记录可能存在实体抽取不全的问题（摘要表述模糊） - `followed_up`字段为大语言模型仅基于摘要文本做出的判断，未经过引用验证 - 前2,090条PubMed记录使用qwen2.5:14b模型处理，其余记录使用qwen2.5:7b模型处理 --- ## 与现有系统的对比 | 系统 | 大语言模型抽取 | 向量嵌入 | 跨数据库关联 | 缺口检测 | 开源/免费 | 聚焦癌症 | |--------------|:--------------:|:--------:|:------------:|:--------:|:---------:|:--------:| | 开放靶点平台 | ❌ | ❌ | Partial（部分支持） | ❌ | Partial（部分支持） | Partial（部分支持） | | SemMedDB | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | | SPOKE | ❌ | ❌ | ✅ | ❌ | Partial（部分支持） | ❌ | | BioGPT | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | | iKraph | ✅ | ❌ | Partial（部分支持） | ❌ | ❌ | ❌ | | PKG2.0 | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | | **OCKG（本研究）** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 目前尚无公开系统同时具备上述六项特性。 --- ## 语料库高频实体 ### 高频化合物 | 化合物名称 | 文档数量 | |------------------|----------| | 多柔比星（doxorubicin） | 1,212 | | 紫杉醇（paclitaxel） | 578 | | 顺铂（cisplatin） | 542 | | 姜黄素（curcumin） | 428 | | 壳聚糖（chitosan） | 330 | | 褪黑素（melatonin） | 327 | | 透明质酸（hyaluronic acid） | 263 | | 多西他赛（docetaxel） | 253 | | 吉西他滨（gemcitabine） | 246 | | PARP抑制剂（PARP inhibitors） | 240 | ### 高频癌症类型 | 癌症类型 | 文档数量 | |------------------------|----------| | 乳腺癌（breast cancer） | 2,007 | | 乳腺肿瘤（breast neoplasms） | 1,413 | | 结直肠癌（colorectal cancer） | 1,377 | | 前列腺癌（prostate cancer） | 773 | | 肺癌（lung cancer） | 686 | | 卵巢癌（ovarian cancer） | 659 | | 黑色素瘤（melanoma） | 633 | | 肝细胞癌（hepatocellular carcinoma） | 624 | | 肺肿瘤（lung neoplasms） | 467 | | 非小细胞肺癌（non-small cell lung cancer） | 453 | --- ## 单条记录结构所有文档无论来源，均遵循统一的结构化schema： json { "doc_id": "pubmed_38291045", "source": "pubmed", "title": "...", "summary": "3-5句通俗英文摘要", "document_type": "research_paper", "cancer_types": ["glioblastoma", "NSCLC"], "pathways_mentioned": ["PI3K/AKT/mTOR", "apoptosis"], "compounds": ["temozolomide", "bevacizumab"], "genes_proteins": ["EGFR", "p53", "KRAS"], "mechanism_of_action": "...", "experimental_result": { "effect": "肿瘤生长抑制率达60%", "model": "异种移植小鼠模型", "outcome": "阳性", "followed_up": false }, "potential_connections": [ "化合物X阻断KRAS-G12C——尚未在胰腺癌中测试" ], "similar_terms": ["激酶抑制剂", "靶向治疗"], "study_phase": "preclinical", "data_quality": "high", "embed_string": "...", "embedding": [0.021, -0.034, "..."] } `followed_up: false` 标记为大语言模型判定未被后续研究跟进的发现——即研究缺口候选。`embedding` 字段为768维语义指纹（使用nomic-embed-text模型生成），支持跨语料库的余弦相似度搜索，不受术语体系、期刊来源或发表年代限制。本公开发布版本不包含问答对。 --- ## 跨源关联发现情况处理三大数据源后，流程共识别出**14,163项跨源关联**——来自不同数据库的文档共享同一化合物、癌症类型与生物学通路，但未相互引用。其中**104项为高置信度关联**，同时共享化合物、癌症类型与通路。示例发现：置信度：0.75 来源：PubMed → pubmed_37326467 来源：临床试验 → trial_NCT05372640 共享化合物：阿贝西利（abemaciclib）共享癌症：乳腺癌共享通路：CDK4/6通路另一项发现：置信度：0.55 来源：临床试验 → NCT06328387 来源：PubMed → 9篇独立论文共享化合物：氯喹（chloroquine）共享通路：细胞自噬通路本流程发现的真实世界案例： > 美国安德森癌症中心的一项已完成临床试验（NCT00501410）测试了西妥昔单抗联合达沙替尼，以克服转移性结直肠癌的EGFR耐药性。另一篇PubMed论文（PMID 27636997）发现，西妥昔单抗联合MEK1/2抑制剂可在NRAS突变结直肠癌中产生合成致死效应——对耐药细胞的杀伤效果最高可达1300倍。二者针对同一癌症、同一药物、同一临床问题，但耐药机制不同，且未相互引用。 --- ## 部署流程 bash pip install requests tqdm ollama pull qwen2.5:7b ollama pull nomic-embed-text 完整流程代码见：[github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) 可选：免费获取NCBI API密钥以提高请求速率（10请求/秒 vs 默认3请求/秒） 1. 在https://www.ncbi.nlm.nih.gov/account/ 注册账号 2. 进入账户设置 → API密钥管理 → 生成密钥 3. 使用方式：`NCBI_API_KEY=your_key python pipeline.py ...` --- ## 运行流程 bash # 先进行测试 python pipeline.py --source pubmed --limit 100 --workers 2 # 完整运行——若中断可恢复执行 python pipeline.py --source pubmed --limit 50000 --workers 3 python pipeline.py --source trials --limit 20000 --workers 3 python pipeline.py --source pubchem --limit 10000 --workers 3 # 查找跨源关联 python pipeline.py --crossref # 查看统计信息 python pipeline.py --stats --- ## 实测性能 | 数据源 | 文档数量 | 耗时（使用qwen2.5:7b模型，RTX 4060 Ti 16GB显卡） | |----------------|----------|--------------------------------------------------| | PubMed | 22,338 | ~55小时 | | ClinicalTrials | 19,979 | ~68小时 | | PubChem | 92 | ~2小时 | 使用3个工作进程，功耗限制为125W以保证稳定运行。总GPU运行时长：125+小时。 --- ## 图谱查询 bash # 查找所有提及某化合物的文档 python query.py --entity compound "sotorasib" # 查找跨源关联 python query.py --connections "sotorasib" # 语义搜索 python query.py --search "KRAS mutation untested compound pancreatic" # 将关联导出为CSV文件 python query.py --export-connections connections.csv --- ## 数据源与授权所有源数据均为公共领域内容： | 数据源 | 所属方 | 授权协议 | |----------------|----------------------|------------------------| | PubMed摘要 | 美国国家医学图书馆 | 公共领域 | | ClinicalTrials.gov | 美国联邦政府 | 公共领域 | | PubChem | 美国国立卫生研究院 | 公共领域 | 本数据集（抽取得到的JSON记录与嵌入向量）采用**知识共享署名4.0协议（CC BY 4.0）**发布——可免费使用，需注明来源。流程代码采用**MIT许可证（MIT License）**发布。问答对未公开发布。 --- ## 学术背景本研究为**基于文献的发现（literature-based discovery, LBD）**与**未被发掘的公共知识**问题（Swanson, 1986）的一项贡献。 **相关研究**：Arsenyan等人2024（生物自然语言处理领域）、iKraph 2023、PubMed KG 2.0（Xu等人2024）、Borchert等人2024、Sarol等人2024、BioStrataKG 2024。 **所属机构**：荷兰蒂尔堡大学，独立学生研究项目。 --- ## 引用格式 bibtex @dataset{ockg2026, title = {Open Cancer Knowledge Graph (OCKG) v1.0}, author = {Pocatilu Daniel Mihai}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/pdm95/open-cancer-kg} } --- *本项目基于学生个人显卡搭建。运行无需任何成本。全球任何研究者均可免费使用。*

提供机构：

pdm95

5,000+

优质数据集

54 个

任务类型

进入经典数据集