five

pdm95/open-cancer-kg

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/pdm95/open-cancer-kg
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - cancer - biomedical - knowledge-graph - embeddings - pubmed - clinical-trials - drug-discovery - literature-based-discovery size_categories: - 10K<n<100K task_categories: - other - feature-extraction pretty_name: Open Cancer Knowledge Graph (OCKG) --- # Open Cancer Knowledge Graph (OCKG) > *The first open, locally-runnable pipeline combining LLM-based structured extraction, vector embeddings, and cross-database linking of PubMed, ClinicalTrials.gov, and PubChem for cancer research gap detection - requiring no budget, no institutional access, and no proprietary tools.* **Pipeline code on GitHub →** [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) --- ## Dataset name **OCKG - Open Cancer Knowledge Graph v1.0** --- ## The problem Cancer research is fragmented across three major public databases that have never been systematically cross-referenced at the document level: - **PubMed** - 35M+ paper abstracts, unstructured text - **ClinicalTrials.gov** - 500k+ registered trials, siloed - **PubChem** - 100M+ chemical compounds, disconnected from literature A compound tested in a 1994 breast cancer paper may share a biological pathway with a 2021 lung trial that failed for an unrelated reason. Because vocabulary differs, journals differ, and no system links them semantically, that connection is never made. This is the *undiscovered public knowledge* problem (Swanson, 1986). This pipeline solves it automatically, at scale, across all cancer types simultaneously. --- ## Dataset statistics (v1.0) | Source | Documents | Status | |--------|-----------|--------| | PubMed | 22,338 | ✅ complete | | ClinicalTrials.gov | 19,979 | ✅ complete | | PubChem | 92 | ✅ complete | | **Total** | **42,409** | ✅ | Additional outputs (not released publicly): - 200,000+ Q&A pairs for LLM fine-tuning (5 per document) - 10,346 research gap hypotheses flagged by the pipeline - 14,163 cross-source connections found between documents sharing biology but never citing each other - 104 high-confidence connections where documents share compound + cancer type + pathway Known limitations: - 2 corrupted records excluded (pipeline interruption during writing) - ~15% of records may have incomplete entity extraction (vague abstracts) - `followed_up` field is an LLM judgment from abstract text alone, not citation-verified - First 2,090 PubMed records processed with qwen2.5:14b, remainder with qwen2.5:7b --- ## How it differs from existing systems | System | LLM extraction | Embeddings | Cross-DB | Gap detection | Open/free | Cancer-focused | |--------|:-:|:-:|:-:|:-:|:-:|:-:| | Open Targets | ❌ | ❌ | Partial | ❌ | Partial | Partial | | SemMedDB | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | | SPOKE | ❌ | ❌ | ✅ | ❌ | Partial | ❌ | | BioGPT | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | | iKraph | ✅ | ❌ | Partial | ❌ | ❌ | ❌ | | PKG2.0 | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | | **OCKG (this work)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | No existing public system combines all six properties. --- ## Top entities in the corpus **Top compounds:** | Compound | Documents | |----------|-----------| | doxorubicin | 1,212 | | paclitaxel | 578 | | cisplatin | 542 | | curcumin | 428 | | chitosan | 330 | | melatonin | 327 | | hyaluronic acid | 263 | | docetaxel | 253 | | gemcitabine | 246 | | PARP inhibitors | 240 | **Top cancer types:** | Cancer Type | Documents | |-------------|-----------| | breast cancer | 2,007 | | breast neoplasms | 1,413 | | colorectal cancer | 1,377 | | prostate cancer | 773 | | lung cancer | 686 | | ovarian cancer | 659 | | melanoma | 633 | | hepatocellular carcinoma | 624 | | lung neoplasms | 467 | | non-small cell lung cancer | 453 | --- ## What each record contains Every document - regardless of source - is structured into the same schema: ```json { "doc_id": "pubmed_38291045", "source": "pubmed", "title": "...", "summary": "3-5 sentence plain-English summary", "document_type": "research_paper", "cancer_types": ["glioblastoma", "NSCLC"], "pathways_mentioned": ["PI3K/AKT/mTOR", "apoptosis"], "compounds": ["temozolomide", "bevacizumab"], "genes_proteins": ["EGFR", "p53", "KRAS"], "mechanism_of_action": "...", "experimental_result": { "effect": "inhibited tumor growth by 60%", "model": "xenograft mouse", "outcome": "positive", "followed_up": false }, "potential_connections": [ "Compound X blocks KRAS-G12C - never tested in pancreatic cancer" ], "similar_terms": ["kinase inhibitor", "targeted therapy"], "study_phase": "preclinical", "data_quality": "high", "embed_string": "...", "embedding": [0.021, -0.034, "..."] } ``` The `followed_up: false` flag marks findings the LLM judged as never built upon - research gap candidates. The `embedding` field is a 768-dimensional semantic fingerprint (nomic-embed-text) enabling cosine similarity search across the entire corpus regardless of vocabulary, journal, or decade. Q&A pairs are not included in this public release. --- ## Cross-source connections found After processing all three sources, the pipeline identified **14,163 cross-source connections** - documents from different databases sharing the same compound, cancer type, and biological pathway without citing each other. Of these, **104 are high-confidence** connections sharing compound + cancer type + pathway simultaneously. Example finding: ``` Confidence: 0.75 pubmed → pubmed_37326467 trials → trial_NCT05372640 Shared compound: abemaciclib Shared cancer: breast cancer Shared pathway: CDK4/6 pathway ``` Another finding: ``` Confidence: 0.55 trial → NCT06328387 pubmed → 9 separate papers Shared compound: chloroquine Shared pathway: autophagy ``` Real-world example discovered by the pipeline: > A completed clinical trial at MD Anderson (NCT00501410) tested cetuximab + dasatinib to overcome EGFR resistance in metastatic colorectal cancer. A separate PubMed paper (PMID 27636997) discovered that combining cetuximab with MEK1/2 inhibition creates a synthetic lethal effect in NRAS-mutant colorectal cancer - up to 1,300x more effective against resistant cells. Same cancer. Same drug. Same clinical problem. Different resistance mechanism. Neither cited the other. --- ## Setup ```bash pip install requests tqdm ollama pull qwen2.5:7b ollama pull nomic-embed-text ``` Full pipeline code at: [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) Optional - free NCBI API key for higher rate limits (10 req/sec vs 3): 1. Register at https://www.ncbi.nlm.nih.gov/account/ 2. Account Settings → API Key Management → Generate 3. Use: `NCBI_API_KEY=your_key python pipeline.py ...` --- ## Run order ```bash # Test first python pipeline.py --source pubmed --limit 100 --workers 2 # Full runs - fully resumable if interrupted python pipeline.py --source pubmed --limit 50000 --workers 3 python pipeline.py --source trials --limit 20000 --workers 3 python pipeline.py --source pubchem --limit 10000 --workers 3 # Find cross-source connections python pipeline.py --crossref # Statistics python pipeline.py --stats ``` --- ## Actual performance measured | Source | Docs | Time (qwen2.5:7b, RTX 4060 Ti 16GB) | |--------|------|--------------------------------------| | PubMed | 22,338 | ~55 hours | | ClinicalTrials | 19,979 | ~68 hours | | PubChem | 92 | ~2 hours | Workers=3, power-limited to 125W for sustained operation. Total GPU runtime: 125+ hours. --- ## Query your graph ```bash # Find all documents mentioning a compound python query.py --entity compound "sotorasib" # Find cross-source connections python query.py --connections "sotorasib" # Semantic search python query.py --search "KRAS mutation untested compound pancreatic" # Export connections to CSV python query.py --export-connections connections.csv ``` --- ## Data sources and licensing All source data is public domain: | Source | Owner | License | |--------|-------|---------| | PubMed abstracts | US National Library of Medicine | Public domain | | ClinicalTrials.gov | US federal government | Public domain | | PubChem | NIH | Public domain | This dataset (extracted JSON records + embeddings) is released under **CC BY 4.0** - free to use with attribution. Pipeline code is released under **MIT License**. Q&A pairs are not released publicly. --- ## Academic context This work is a contribution to **literature-based discovery (LBD)** and the *undiscovered public knowledge* problem (Swanson, 1986). **Related work:** Arsenyan et al. 2024 (BioNLP), iKraph 2023, PubMed KG 2.0 (Xu et al. 2024), Borchert et al. 2024, Sarol et al. 2024, BioStrataKG 2024. **Affiliation:** Independent student research, Tilburg University, The Netherlands. --- ## Citation ```bibtex @dataset{ockg2026, title = {Open Cancer Knowledge Graph (OCKG) v1.0}, author = {Pocatilu Daniel Mihai}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/pdm95/open-cancer-kg} } ``` --- *Built on a student's GPU. Costs nothing to run. Free for any researcher anywhere.*

license: 知识共享署名4.0协议(CC BY 4.0) language: - 英语 tags: - 癌症 - 生物医学 - 知识图谱(knowledge-graph) - 向量嵌入(vector embeddings) - PubMed - 临床试验(clinical-trials) - 药物发现(drug-discovery) - 基于文献的发现(literature-based-discovery) size_categories: - 10000 < 样本量 < 100000 task_categories: - 其他 - 特征提取(feature-extraction) pretty_name: 开放癌症知识图谱(Open Cancer Knowledge Graph, OCKG) --- # 开放癌症知识图谱(Open Cancer Knowledge Graph, OCKG) > *首个可本地运行的开源流程,结合基于大语言模型(Large Language Model, LLM)的结构化抽取、向量嵌入(vector embeddings)以及PubMed、ClinicalTrials.gov与PubChem的跨数据库关联,用于癌症研究缺口检测——无需预算、无需机构权限、无需专有工具。* **GitHub 流程代码 →** [github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) --- ## 数据集名称 **OCKG - 开放癌症知识图谱 v1.0** --- ## 研究背景与问题 癌症研究分散于三大主流公共数据库,此前从未在文档层面进行系统性交叉对照: - **PubMed**:超3500万篇论文摘要,非结构化文本 - **ClinicalTrials.gov**:超50万项注册临床试验,数据孤岛 - **PubChem**:超1亿种化合物,与文献无关联 1994年乳腺癌论文中测试的化合物,可能与2021年某因非相关原因失败的肺癌临床试验共享生物学通路。由于术语体系差异、期刊来源不同,且无语义关联系统,这类关联从未被发现。这就是**未被发掘的公共知识**问题(Swanson, 1986)。本流程可自动、大规模地同时解决所有癌症类型的该类问题。 --- ## 数据集统计信息(v1.0版本) | 来源 | 文档数量 | 状态 | |------------|----------|--------| | PubMed | 22,338 | ✅ 已完成 | | ClinicalTrials.gov | 19,979 | ✅ 已完成 | | PubChem | 92 | ✅ 已完成 | | **总计** | **42,409** | ✅ | 额外产出(未公开发布): - 超20万对用于大语言模型微调的问答对(每篇文档对应5对) - 流程标记的10,346项研究缺口假设 - 14,163项跨源关联:共享生物学特征但未相互引用的文档间关联 - 104项高置信度关联:同时共享化合物、癌症类型与通路的文档关联 已知局限性: - 排除2条损坏的记录(写入过程中流程中断导致) - 约15%的记录可能存在实体抽取不全的问题(摘要表述模糊) - `followed_up`字段为大语言模型仅基于摘要文本做出的判断,未经过引用验证 - 前2,090条PubMed记录使用qwen2.5:14b模型处理,其余记录使用qwen2.5:7b模型处理 --- ## 与现有系统的对比 | 系统 | 大语言模型抽取 | 向量嵌入 | 跨数据库关联 | 缺口检测 | 开源/免费 | 聚焦癌症 | |--------------|:--------------:|:--------:|:------------:|:--------:|:---------:|:--------:| | 开放靶点平台 | ❌ | ❌ | Partial(部分支持) | ❌ | Partial(部分支持) | Partial(部分支持) | | SemMedDB | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | | SPOKE | ❌ | ❌ | ✅ | ❌ | Partial(部分支持) | ❌ | | BioGPT | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | | iKraph | ✅ | ❌ | Partial(部分支持) | ❌ | ❌ | ❌ | | PKG2.0 | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | | **OCKG(本研究)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 目前尚无公开系统同时具备上述六项特性。 --- ## 语料库高频实体 ### 高频化合物 | 化合物名称 | 文档数量 | |------------------|----------| | 多柔比星(doxorubicin) | 1,212 | | 紫杉醇(paclitaxel) | 578 | | 顺铂(cisplatin) | 542 | | 姜黄素(curcumin) | 428 | | 壳聚糖(chitosan) | 330 | | 褪黑素(melatonin) | 327 | | 透明质酸(hyaluronic acid) | 263 | | 多西他赛(docetaxel) | 253 | | 吉西他滨(gemcitabine) | 246 | | PARP抑制剂(PARP inhibitors) | 240 | ### 高频癌症类型 | 癌症类型 | 文档数量 | |------------------------|----------| | 乳腺癌(breast cancer) | 2,007 | | 乳腺肿瘤(breast neoplasms) | 1,413 | | 结直肠癌(colorectal cancer) | 1,377 | | 前列腺癌(prostate cancer) | 773 | | 肺癌(lung cancer) | 686 | | 卵巢癌(ovarian cancer) | 659 | | 黑色素瘤(melanoma) | 633 | | 肝细胞癌(hepatocellular carcinoma) | 624 | | 肺肿瘤(lung neoplasms) | 467 | | 非小细胞肺癌(non-small cell lung cancer) | 453 | --- ## 单条记录结构 所有文档无论来源,均遵循统一的结构化schema: json { "doc_id": "pubmed_38291045", "source": "pubmed", "title": "...", "summary": "3-5句通俗英文摘要", "document_type": "research_paper", "cancer_types": ["glioblastoma", "NSCLC"], "pathways_mentioned": ["PI3K/AKT/mTOR", "apoptosis"], "compounds": ["temozolomide", "bevacizumab"], "genes_proteins": ["EGFR", "p53", "KRAS"], "mechanism_of_action": "...", "experimental_result": { "effect": "肿瘤生长抑制率达60%", "model": "异种移植小鼠模型", "outcome": "阳性", "followed_up": false }, "potential_connections": [ "化合物X阻断KRAS-G12C——尚未在胰腺癌中测试" ], "similar_terms": ["激酶抑制剂", "靶向治疗"], "study_phase": "preclinical", "data_quality": "high", "embed_string": "...", "embedding": [0.021, -0.034, "..."] } `followed_up: false` 标记为大语言模型判定未被后续研究跟进的发现——即研究缺口候选。`embedding` 字段为768维语义指纹(使用nomic-embed-text模型生成),支持跨语料库的余弦相似度搜索,不受术语体系、期刊来源或发表年代限制。本公开发布版本不包含问答对。 --- ## 跨源关联发现情况 处理三大数据源后,流程共识别出**14,163项跨源关联**——来自不同数据库的文档共享同一化合物、癌症类型与生物学通路,但未相互引用。其中**104项为高置信度关联**,同时共享化合物、癌症类型与通路。 示例发现: 置信度:0.75 来源:PubMed → pubmed_37326467 来源:临床试验 → trial_NCT05372640 共享化合物:阿贝西利(abemaciclib) 共享癌症:乳腺癌 共享通路:CDK4/6通路 另一项发现: 置信度:0.55 来源:临床试验 → NCT06328387 来源:PubMed → 9篇独立论文 共享化合物:氯喹(chloroquine) 共享通路:细胞自噬通路 本流程发现的真实世界案例: > 美国安德森癌症中心的一项已完成临床试验(NCT00501410)测试了西妥昔单抗联合达沙替尼,以克服转移性结直肠癌的EGFR耐药性。另一篇PubMed论文(PMID 27636997)发现,西妥昔单抗联合MEK1/2抑制剂可在NRAS突变结直肠癌中产生合成致死效应——对耐药细胞的杀伤效果最高可达1300倍。二者针对同一癌症、同一药物、同一临床问题,但耐药机制不同,且未相互引用。 --- ## 部署流程 bash pip install requests tqdm ollama pull qwen2.5:7b ollama pull nomic-embed-text 完整流程代码见:[github.com/DaniMihai95/open-cancer-kg](https://github.com/DaniMihai95/open-cancer-kg) 可选:免费获取NCBI API密钥以提高请求速率(10请求/秒 vs 默认3请求/秒) 1. 在https://www.ncbi.nlm.nih.gov/account/ 注册账号 2. 进入账户设置 → API密钥管理 → 生成密钥 3. 使用方式:`NCBI_API_KEY=your_key python pipeline.py ...` --- ## 运行流程 bash # 先进行测试 python pipeline.py --source pubmed --limit 100 --workers 2 # 完整运行——若中断可恢复执行 python pipeline.py --source pubmed --limit 50000 --workers 3 python pipeline.py --source trials --limit 20000 --workers 3 python pipeline.py --source pubchem --limit 10000 --workers 3 # 查找跨源关联 python pipeline.py --crossref # 查看统计信息 python pipeline.py --stats --- ## 实测性能 | 数据源 | 文档数量 | 耗时(使用qwen2.5:7b模型,RTX 4060 Ti 16GB显卡) | |----------------|----------|--------------------------------------------------| | PubMed | 22,338 | ~55小时 | | ClinicalTrials | 19,979 | ~68小时 | | PubChem | 92 | ~2小时 | 使用3个工作进程,功耗限制为125W以保证稳定运行。总GPU运行时长:125+小时。 --- ## 图谱查询 bash # 查找所有提及某化合物的文档 python query.py --entity compound "sotorasib" # 查找跨源关联 python query.py --connections "sotorasib" # 语义搜索 python query.py --search "KRAS mutation untested compound pancreatic" # 将关联导出为CSV文件 python query.py --export-connections connections.csv --- ## 数据源与授权 所有源数据均为公共领域内容: | 数据源 | 所属方 | 授权协议 | |----------------|----------------------|------------------------| | PubMed摘要 | 美国国家医学图书馆 | 公共领域 | | ClinicalTrials.gov | 美国联邦政府 | 公共领域 | | PubChem | 美国国立卫生研究院 | 公共领域 | 本数据集(抽取得到的JSON记录与嵌入向量)采用**知识共享署名4.0协议(CC BY 4.0)**发布——可免费使用,需注明来源。流程代码采用**MIT许可证(MIT License)**发布。问答对未公开发布。 --- ## 学术背景 本研究为**基于文献的发现(literature-based discovery, LBD)**与**未被发掘的公共知识**问题(Swanson, 1986)的一项贡献。 **相关研究**:Arsenyan等人2024(生物自然语言处理领域)、iKraph 2023、PubMed KG 2.0(Xu等人2024)、Borchert等人2024、Sarol等人2024、BioStrataKG 2024。 **所属机构**:荷兰蒂尔堡大学,独立学生研究项目。 --- ## 引用格式 bibtex @dataset{ockg2026, title = {Open Cancer Knowledge Graph (OCKG) v1.0}, author = {Pocatilu Daniel Mihai}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/pdm95/open-cancer-kg} } --- *本项目基于学生个人显卡搭建。运行无需任何成本。全球任何研究者均可免费使用。*
提供机构:
pdm95
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作