jang1563/NegBioDB

Name: jang1563/NegBioDB
Creator: jang1563
Published: 2026-04-19 22:16:50
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/jang1563/NegBioDB

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 language: - en tags: - biomedical - negative-results - benchmark - drug-target-interaction - clinical-trials - protein-protein-interaction - gene-essentiality - variant-pathogenicity - llm-evaluation task_categories: - text-classification - question-answering - text-generation - tabular-classification size_categories: - 10M<n<100M pretty_name: "NegBioDB: Negative Results Database & Dual ML/LLM Benchmark" --- # NegBioDB **A Negative-Results Database and Dual ML/LLM Benchmark for Biomedical Sciences** [![GitHub](https://img.shields.io/badge/GitHub-jang1563%2FNegBioDB-181717?logo=github)](https://github.com/jang1563/NegBioDB) [![License: CC BY-SA 4.0](https://img.shields.io/badge/License-CC_BY--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/4.0/) An estimated 90% of biomedical experiments produce null or inconclusive findings, yet the overwhelming majority remain unpublished. **NegBioDB** systematically aggregates experimentally confirmed *negative* results across five biomedical domains and pairs them with a dual-track benchmark — traditional ML prediction and modern LLM reasoning — that quantifies how publication bias propagates into AI systems. This Hugging Face dataset mirrors the pre-built ML and LLM splits from the [main repository](https://github.com/jang1563/NegBioDB). --- ## Overview | Domain | Negative results | Key entities | Sources | ML runs | LLM runs | |--------|-----------------:|--------------|---------|--------:|---------:| | **DTI** — Drug–Target Interaction | 30,459,583 | 919K compounds, 3.7K targets | ChEMBL, PubChem, BindingDB, DAVIS | 24 / 24 | 81 / 81 | | **CT** — Clinical Trial Failure | 132,925 | 177K interventions, 56K conditions | AACT, CTO, Open Targets, Shi & Du | 108 / 108 | 80 / 80 | | **PPI** — Protein–Protein Interaction | 2,229,670 | 18.4K proteins | IntAct, HuRI, hu.MAP 3.0, STRING | 54 / 54 | 80 / 80 | | **GE** — Gene Essentiality (DepMap) | 28,759,256 | 19,554 genes, 2,132 cell lines | DepMap CRISPR + RNAi | 42 / 42 | 80 / 80 | | **VP** — Variant Pathogenicity | 2,442,718 | 2.43M variants, 18.4K genes, 10K diseases | ClinVar, gnomAD, ClinGen, CADD/REVEL/AlphaMissense | 72 / 72 | 20 / 20 | | **Total** | **~64.0M** | — | **17 sources** | **300** | **341** | PPI export rows after split filtering: 2,220,786. VP M1 balanced export: 1,255,150 rows. --- ## Why NegBioDB? Most biomedical ML benchmarks rely on **synthetic negatives** — random non-edges in a graph, decoy compounds, or unobserved pairs — which trivially leak through degree statistics or prior frequency. NegBioDB instead provides **experimentally confirmed negatives**: failed assays, failed clinical trials, validated non-interactions, non-essential genes in specific contexts, and benign variants. This lets you: - Quantify the gap between random-split AUROC and *real* generalization (cold-entity / temporal / scaffold splits). - Stress-test LLMs on the L4 task (*tested vs. untested*) — a discriminator that exposes memorization vs. reasoning. - Compare your method on a publication-bias-corrected baseline, not a degree-matched proxy. --- ## File structure ### DTI — root level | File | Size | Rows | Description | |------|-----:|-----:|-------------| | `negbiodb_dti_pairs.parquet` | 139 MB | ~25M | All negative DTI pairs with 5 split columns + provenance | | `negbiodb_m1_balanced.parquet` | 270 MB | 1,725,446 | M1 balanced (1:1 active:inactive) | | `negbiodb_m1_realistic.parquet` | 753 MB | 9,489,953 | M1 realistic (1:10) | | `negbiodb_m1_balanced_ddb.parquet` | 1.0 GB | 1,725,446 | Degree-balanced split | | `negbiodb_m1_uniform_random.parquet` | 467 MB | 1,767,380 | Control: uniform random | | `negbiodb_m1_degree_matched.parquet` | 275 MB | 1,767,380 | Control: degree-matched | | `chembl_positives_pchembl6.parquet` | — | 863K | ChEMBL actives (pChEMBL ≥ 6) | | `compound_names.parquet` | — | 144K | Compound names for LLM tasks | ### CT — `ct/` | File | Rows | Description | |------|-----:|-------------| | `ct/negbiodb_ct_pairs.parquet` | 102,850 | All failure pairs, 6 splits | | `ct/negbiodb_ct_m1_balanced.parquet` | 11,222 | Binary (success / failure), 1:1 | | `ct/negbiodb_ct_m1_realistic.parquet` | 36,957 | Binary, ~1:6 | | `ct/negbiodb_ct_m1_smiles_only.parquet` | 3,878 | SMILES-resolved subset | | `ct/negbiodb_ct_m2.parquet` | 112,298 | 7-way failure-mode classification | ### PPI — `ppi/` | File | Rows | Description | |------|-----:|-------------| | `ppi/negbiodb_ppi_pairs.parquet` | 2,220,786 | All negative pairs, 4 splits | | `ppi/ppi_m1_balanced.parquet` | 123,456 | M1 (1:1) | | `ppi/ppi_m1_realistic.parquet` | 679,008 | M1 (1:10) | | `ppi/ppi_m1_balanced_ddb.parquet` | — | Degree-balanced split | | `ppi/ppi_m1_uniform_random.parquet` | — | Control | | `ppi/ppi_m1_degree_matched.parquet` | — | Control | ### GE — `ge/` | File | Description | |------|-------------| | `ge/negbiodb_ge_pairs.parquet` | 22.5M gene–cell-line pairs, 5 split columns (~770 MB) | | `ge_gene_aggregates.parquet` | Per-gene aggregated essentiality features | ### VP — `vp_ml/` | File | Rows | Description | |------|-----:|-------------| | `vp_ml/vp_m1_balanced.parquet` | 1,255,150 | M1 balanced (gold/silver positives, 1:1) | | `vp_ml/vp_m1_realistic.parquet` | 2,442,718 | M1 realistic (full negative set) | ### LLM benchmarks — `llm_benchmarks/` and per-domain `*_llm/` LLM datasets cover four reasoning levels (L1–L4) per domain: | Level | Question | Evaluation | |-------|----------|------------| | **L1** | Multiple-choice: which is *not* a known interaction / failure / ... | accuracy, MCC | | **L2** | Structured extraction into a typed schema | field-level F1, schema compliance | | **L3** | Open-ended scientific reasoning on a negative finding | LLM-judge rubric (1–5 across 4–6 axes) | | **L4** | Discrimination: *tested-as-negative* vs. *untested* | MCC; contamination-flag analysis | Models evaluated under both zero-shot and 3-shot configurations: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash, Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct. --- ## Benchmark tasks ### ML | Task | Domain | Type | Splits | |------|--------|------|--------| | **M1** | DTI | Binary (active / inactive) | random, cold_compound, cold_target, degree_balanced | | **CT-M1** | CT | Binary (success / failure) | random, cold_drug, cold_condition, temporal, scaffold, cold_both | | **CT-M2** | CT | 7-way failure category | same as CT-M1 | | **PPI-M1** | PPI | Binary (interact / non-interact) | random, cold_protein, cold_both, degree_balanced | | **GE-M1** | GE | Binary (essential / non-essential) | random, cold_gene, cold_cell_line, cold_both, degree_balanced | | **VP-M1** | VP | Binary (pathogenic / benign) | random, cold_gene, cold_disease, temporal | Metrics: **LogAUC[0.001,0.1]** (primary, early-enrichment), **BEDROC (α=20)**, **EF@1% / EF@5%**, **AUPRC**, **MCC**, **AUROC**. --- ## Headline findings ### ML — the choice of negatives shapes every metric | DTI model | Random (NegBioDB) | Random (degree-matched) | Cold-target | |-----------|------------------:|------------------------:|------------:| | DeepDTA | 0.833 | **0.919** | 0.325 | | GraphDTA | 0.843 | **0.967** | 0.241 | | DrugBAN | 0.830 | **0.955** | 0.151 | - **CT.** Confirmed-failure negatives are trivially separable for binary tasks (AUROC ≈ 1.0). The 7-way failure-mode classification (CT-M2) remains hard (best macro-F1 = 0.51). - **PPI.** PIPR cold-both AUROC drops below random (0.409); MLP-on-features stays robust (0.950). - **GE.** Cold-gene splits expose generalization gaps invisible under random splits. - **VP.** Random splits saturate (AUROC 0.995 / MCC 0.932); cold-disease splits expose AUROC-vs-MCC calibration failures. ### LLM — L4 (tested vs. untested) is where models actually differ | Domain | L4 MCC range | Memorization signal | |--------|--------------|---------------------| | DTI | ≤ 0.18 | Not detected | | GE | ≤ 0.22 | Not detected | | PPI | 0.33–0.44 | **Yes** — pre-2015 pairs identified at 59–79%; post-2020 at 7–25% | | CT | 0.48–0.56 | Not detected | | VP | n/a (single-class test) | n/a | Across PPI / GE / DC / CP / VP, **L3** (open-ended reasoning, judge-graded) shows zero-shot ≫ few-shot for most models — providing exemplars *degrades* reasoning quality, a robust cross-domain pattern. --- ## Quickstart ```python from huggingface_hub import hf_hub_download import pandas as pd # DTI: pull the M1 balanced split path = hf_hub_download( repo_id="jang1563/NegBioDB", filename="negbiodb_m1_balanced.parquet", repo_type="dataset", ) df = pd.read_parquet(path) print(df.head()) # CT: subdirectory addressing works the same way ct_path = hf_hub_download( repo_id="jang1563/NegBioDB", filename="ct/negbiodb_ct_m1_balanced.parquet", repo_type="dataset", ) ct_df = pd.read_parquet(ct_path) ``` For end-to-end ETL (raw download → SQLite → split export → ML / LLM evaluation), use the [main repository](https://github.com/jang1563/NegBioDB) which provides per-domain CLI entry points and SLURM scripts. --- ## Data sources & licenses | Domain | Source | License | Contribution | |--------|--------|---------|--------------| | **DTI** | [ChEMBL v36](https://www.ebi.ac.uk/chembl/) | CC BY-SA 3.0 | Curated bioactivity | | | [PubChem BioAssay](https://pubchem.ncbi.nlm.nih.gov/) | Public Domain | HTS screening | | | [BindingDB](https://www.bindingdb.org/) | CC BY 3.0 | Binding measurements | | | [DAVIS](https://github.com/dingyan20/Davis-Dataset-for-DTA-Prediction) | Public | Kinase selectivity | | **CT** | [AACT / ClinicalTrials.gov](https://aact.ctti-clinicaltrials.org/) | Public Domain | Trial metadata | | | [CTO](https://github.com/fairnessforensics/CTO) | MIT | Trial outcomes | | | [Open Targets](https://www.opentargets.org/) | Apache 2.0 | Drug–target mappings | | | [Shi & Du 2024](https://doi.org/10.1038/s41597-024-03399-2) | CC BY 4.0 | Safety / efficacy | | **PPI** | [IntAct](https://www.ebi.ac.uk/intact/) | CC BY 4.0 | Curated non-interactions | | | [HuRI](http://www.interactome-atlas.org/) | CC BY 4.0 | Y2H systematic negatives | | | [hu.MAP 3.0](https://humap3.proteincomplexes.org/) | MIT | Complex-derived | | | [STRING v12.0](https://string-db.org/) | CC BY 4.0 | Zero-evidence pairs | | **GE** | [DepMap CRISPR (Chronos)](https://depmap.org/) | CC BY 4.0 | Gene essentiality | | | [DepMap RNAi (DEMETER2)](https://depmap.org/) | CC BY 4.0 | RNAi screens | | **VP** | [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) | Public Domain | Clinical variants | | | [gnomAD v4.1](https://gnomad.broadinstitute.org/) | CC0 | Population variants | | | [ClinGen](https://clinicalgenome.org/) | CC0 | Gene–disease validity | | | [CADD](https://cadd.gs.washington.edu/) | Free non-commercial | Functional scores | | | [REVEL](https://sites.google.com/site/revelgenomics/) | Free | Missense pathogenicity | | | [AlphaMissense](https://alphamissense.broadinstitute.org/) | CC BY-NC-SA 4.0 | Missense pathogenicity | Per-source attribution (versions, download dates, normalization steps) is in [`docs/methodology_notes.md`](https://github.com/jang1563/NegBioDB/blob/main/docs/methodology_notes.md). --- ## Citation ```bibtex @misc{negbiodb2026, title = {NegBioDB: A Negative-Results Database and Dual ML/LLM Benchmark for Biomedical Sciences}, author = {Kim, JangKeun}, year = {2026}, url = {https://github.com/jang1563/NegBioDB} } ``` --- ## License This dataset is released under **CC BY-SA 4.0** — required by the viral clause of ChEMBL's CC BY-SA 3.0. All redistributed source data retain their original licenses (table above). AlphaMissense scores are **non-commercial only**; commercial users should remove the AlphaMissense columns before downstream use.

提供机构：

jang1563

5,000+

优质数据集

54 个

任务类型

进入经典数据集