jang1563/NegBioDB
收藏Hugging Face2026-04-19 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jang1563/NegBioDB
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- en
tags:
- biomedical
- negative-results
- benchmark
- drug-target-interaction
- clinical-trials
- protein-protein-interaction
- gene-essentiality
- variant-pathogenicity
- llm-evaluation
task_categories:
- text-classification
- question-answering
- text-generation
- tabular-classification
size_categories:
- 10M<n<100M
pretty_name: "NegBioDB: Negative Results Database & Dual ML/LLM Benchmark"
---
# NegBioDB
**A Negative-Results Database and Dual ML/LLM Benchmark for Biomedical Sciences**
[](https://github.com/jang1563/NegBioDB)
[](https://creativecommons.org/licenses/by-sa/4.0/)
An estimated 90% of biomedical experiments produce null or inconclusive findings, yet the overwhelming majority remain unpublished. **NegBioDB** systematically aggregates experimentally confirmed *negative* results across five biomedical domains and pairs them with a dual-track benchmark — traditional ML prediction and modern LLM reasoning — that quantifies how publication bias propagates into AI systems.
This Hugging Face dataset mirrors the pre-built ML and LLM splits from the [main repository](https://github.com/jang1563/NegBioDB).
---
## Overview
| Domain | Negative results | Key entities | Sources | ML runs | LLM runs |
|--------|-----------------:|--------------|---------|--------:|---------:|
| **DTI** — Drug–Target Interaction | 30,459,583 | 919K compounds, 3.7K targets | ChEMBL, PubChem, BindingDB, DAVIS | 24 / 24 | 81 / 81 |
| **CT** — Clinical Trial Failure | 132,925 | 177K interventions, 56K conditions | AACT, CTO, Open Targets, Shi & Du | 108 / 108 | 80 / 80 |
| **PPI** — Protein–Protein Interaction | 2,229,670 | 18.4K proteins | IntAct, HuRI, hu.MAP 3.0, STRING | 54 / 54 | 80 / 80 |
| **GE** — Gene Essentiality (DepMap) | 28,759,256 | 19,554 genes, 2,132 cell lines | DepMap CRISPR + RNAi | 42 / 42 | 80 / 80 |
| **VP** — Variant Pathogenicity | 2,442,718 | 2.43M variants, 18.4K genes, 10K diseases | ClinVar, gnomAD, ClinGen, CADD/REVEL/AlphaMissense | 72 / 72 | 20 / 20 |
| **Total** | **~64.0M** | — | **17 sources** | **300** | **341** |
PPI export rows after split filtering: 2,220,786. VP M1 balanced export: 1,255,150 rows.
---
## Why NegBioDB?
Most biomedical ML benchmarks rely on **synthetic negatives** — random non-edges in a graph, decoy compounds, or unobserved pairs — which trivially leak through degree statistics or prior frequency. NegBioDB instead provides **experimentally confirmed negatives**: failed assays, failed clinical trials, validated non-interactions, non-essential genes in specific contexts, and benign variants. This lets you:
- Quantify the gap between random-split AUROC and *real* generalization (cold-entity / temporal / scaffold splits).
- Stress-test LLMs on the L4 task (*tested vs. untested*) — a discriminator that exposes memorization vs. reasoning.
- Compare your method on a publication-bias-corrected baseline, not a degree-matched proxy.
---
## File structure
### DTI — root level
| File | Size | Rows | Description |
|------|-----:|-----:|-------------|
| `negbiodb_dti_pairs.parquet` | 139 MB | ~25M | All negative DTI pairs with 5 split columns + provenance |
| `negbiodb_m1_balanced.parquet` | 270 MB | 1,725,446 | M1 balanced (1:1 active:inactive) |
| `negbiodb_m1_realistic.parquet` | 753 MB | 9,489,953 | M1 realistic (1:10) |
| `negbiodb_m1_balanced_ddb.parquet` | 1.0 GB | 1,725,446 | Degree-balanced split |
| `negbiodb_m1_uniform_random.parquet` | 467 MB | 1,767,380 | Control: uniform random |
| `negbiodb_m1_degree_matched.parquet` | 275 MB | 1,767,380 | Control: degree-matched |
| `chembl_positives_pchembl6.parquet` | — | 863K | ChEMBL actives (pChEMBL ≥ 6) |
| `compound_names.parquet` | — | 144K | Compound names for LLM tasks |
### CT — `ct/`
| File | Rows | Description |
|------|-----:|-------------|
| `ct/negbiodb_ct_pairs.parquet` | 102,850 | All failure pairs, 6 splits |
| `ct/negbiodb_ct_m1_balanced.parquet` | 11,222 | Binary (success / failure), 1:1 |
| `ct/negbiodb_ct_m1_realistic.parquet` | 36,957 | Binary, ~1:6 |
| `ct/negbiodb_ct_m1_smiles_only.parquet` | 3,878 | SMILES-resolved subset |
| `ct/negbiodb_ct_m2.parquet` | 112,298 | 7-way failure-mode classification |
### PPI — `ppi/`
| File | Rows | Description |
|------|-----:|-------------|
| `ppi/negbiodb_ppi_pairs.parquet` | 2,220,786 | All negative pairs, 4 splits |
| `ppi/ppi_m1_balanced.parquet` | 123,456 | M1 (1:1) |
| `ppi/ppi_m1_realistic.parquet` | 679,008 | M1 (1:10) |
| `ppi/ppi_m1_balanced_ddb.parquet` | — | Degree-balanced split |
| `ppi/ppi_m1_uniform_random.parquet` | — | Control |
| `ppi/ppi_m1_degree_matched.parquet` | — | Control |
### GE — `ge/`
| File | Description |
|------|-------------|
| `ge/negbiodb_ge_pairs.parquet` | 22.5M gene–cell-line pairs, 5 split columns (~770 MB) |
| `ge_gene_aggregates.parquet` | Per-gene aggregated essentiality features |
### VP — `vp_ml/`
| File | Rows | Description |
|------|-----:|-------------|
| `vp_ml/vp_m1_balanced.parquet` | 1,255,150 | M1 balanced (gold/silver positives, 1:1) |
| `vp_ml/vp_m1_realistic.parquet` | 2,442,718 | M1 realistic (full negative set) |
### LLM benchmarks — `llm_benchmarks/` and per-domain `*_llm/`
LLM datasets cover four reasoning levels (L1–L4) per domain:
| Level | Question | Evaluation |
|-------|----------|------------|
| **L1** | Multiple-choice: which is *not* a known interaction / failure / ... | accuracy, MCC |
| **L2** | Structured extraction into a typed schema | field-level F1, schema compliance |
| **L3** | Open-ended scientific reasoning on a negative finding | LLM-judge rubric (1–5 across 4–6 axes) |
| **L4** | Discrimination: *tested-as-negative* vs. *untested* | MCC; contamination-flag analysis |
Models evaluated under both zero-shot and 3-shot configurations: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash, Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct.
---
## Benchmark tasks
### ML
| Task | Domain | Type | Splits |
|------|--------|------|--------|
| **M1** | DTI | Binary (active / inactive) | random, cold_compound, cold_target, degree_balanced |
| **CT-M1** | CT | Binary (success / failure) | random, cold_drug, cold_condition, temporal, scaffold, cold_both |
| **CT-M2** | CT | 7-way failure category | same as CT-M1 |
| **PPI-M1** | PPI | Binary (interact / non-interact) | random, cold_protein, cold_both, degree_balanced |
| **GE-M1** | GE | Binary (essential / non-essential) | random, cold_gene, cold_cell_line, cold_both, degree_balanced |
| **VP-M1** | VP | Binary (pathogenic / benign) | random, cold_gene, cold_disease, temporal |
Metrics: **LogAUC[0.001,0.1]** (primary, early-enrichment), **BEDROC (α=20)**, **EF@1% / EF@5%**, **AUPRC**, **MCC**, **AUROC**.
---
## Headline findings
### ML — the choice of negatives shapes every metric
| DTI model | Random (NegBioDB) | Random (degree-matched) | Cold-target |
|-----------|------------------:|------------------------:|------------:|
| DeepDTA | 0.833 | **0.919** | 0.325 |
| GraphDTA | 0.843 | **0.967** | 0.241 |
| DrugBAN | 0.830 | **0.955** | 0.151 |
- **CT.** Confirmed-failure negatives are trivially separable for binary tasks (AUROC ≈ 1.0). The 7-way failure-mode classification (CT-M2) remains hard (best macro-F1 = 0.51).
- **PPI.** PIPR cold-both AUROC drops below random (0.409); MLP-on-features stays robust (0.950).
- **GE.** Cold-gene splits expose generalization gaps invisible under random splits.
- **VP.** Random splits saturate (AUROC 0.995 / MCC 0.932); cold-disease splits expose AUROC-vs-MCC calibration failures.
### LLM — L4 (tested vs. untested) is where models actually differ
| Domain | L4 MCC range | Memorization signal |
|--------|--------------|---------------------|
| DTI | ≤ 0.18 | Not detected |
| GE | ≤ 0.22 | Not detected |
| PPI | 0.33–0.44 | **Yes** — pre-2015 pairs identified at 59–79%; post-2020 at 7–25% |
| CT | 0.48–0.56 | Not detected |
| VP | n/a (single-class test) | n/a |
Across PPI / GE / DC / CP / VP, **L3** (open-ended reasoning, judge-graded) shows zero-shot ≫ few-shot for most models — providing exemplars *degrades* reasoning quality, a robust cross-domain pattern.
---
## Quickstart
```python
from huggingface_hub import hf_hub_download
import pandas as pd
# DTI: pull the M1 balanced split
path = hf_hub_download(
repo_id="jang1563/NegBioDB",
filename="negbiodb_m1_balanced.parquet",
repo_type="dataset",
)
df = pd.read_parquet(path)
print(df.head())
# CT: subdirectory addressing works the same way
ct_path = hf_hub_download(
repo_id="jang1563/NegBioDB",
filename="ct/negbiodb_ct_m1_balanced.parquet",
repo_type="dataset",
)
ct_df = pd.read_parquet(ct_path)
```
For end-to-end ETL (raw download → SQLite → split export → ML / LLM evaluation), use the [main repository](https://github.com/jang1563/NegBioDB) which provides per-domain CLI entry points and SLURM scripts.
---
## Data sources & licenses
| Domain | Source | License | Contribution |
|--------|--------|---------|--------------|
| **DTI** | [ChEMBL v36](https://www.ebi.ac.uk/chembl/) | CC BY-SA 3.0 | Curated bioactivity |
| | [PubChem BioAssay](https://pubchem.ncbi.nlm.nih.gov/) | Public Domain | HTS screening |
| | [BindingDB](https://www.bindingdb.org/) | CC BY 3.0 | Binding measurements |
| | [DAVIS](https://github.com/dingyan20/Davis-Dataset-for-DTA-Prediction) | Public | Kinase selectivity |
| **CT** | [AACT / ClinicalTrials.gov](https://aact.ctti-clinicaltrials.org/) | Public Domain | Trial metadata |
| | [CTO](https://github.com/fairnessforensics/CTO) | MIT | Trial outcomes |
| | [Open Targets](https://www.opentargets.org/) | Apache 2.0 | Drug–target mappings |
| | [Shi & Du 2024](https://doi.org/10.1038/s41597-024-03399-2) | CC BY 4.0 | Safety / efficacy |
| **PPI** | [IntAct](https://www.ebi.ac.uk/intact/) | CC BY 4.0 | Curated non-interactions |
| | [HuRI](http://www.interactome-atlas.org/) | CC BY 4.0 | Y2H systematic negatives |
| | [hu.MAP 3.0](https://humap3.proteincomplexes.org/) | MIT | Complex-derived |
| | [STRING v12.0](https://string-db.org/) | CC BY 4.0 | Zero-evidence pairs |
| **GE** | [DepMap CRISPR (Chronos)](https://depmap.org/) | CC BY 4.0 | Gene essentiality |
| | [DepMap RNAi (DEMETER2)](https://depmap.org/) | CC BY 4.0 | RNAi screens |
| **VP** | [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) | Public Domain | Clinical variants |
| | [gnomAD v4.1](https://gnomad.broadinstitute.org/) | CC0 | Population variants |
| | [ClinGen](https://clinicalgenome.org/) | CC0 | Gene–disease validity |
| | [CADD](https://cadd.gs.washington.edu/) | Free non-commercial | Functional scores |
| | [REVEL](https://sites.google.com/site/revelgenomics/) | Free | Missense pathogenicity |
| | [AlphaMissense](https://alphamissense.broadinstitute.org/) | CC BY-NC-SA 4.0 | Missense pathogenicity |
Per-source attribution (versions, download dates, normalization steps) is in [`docs/methodology_notes.md`](https://github.com/jang1563/NegBioDB/blob/main/docs/methodology_notes.md).
---
## Citation
```bibtex
@misc{negbiodb2026,
title = {NegBioDB: A Negative-Results Database and Dual ML/LLM Benchmark
for Biomedical Sciences},
author = {Kim, JangKeun},
year = {2026},
url = {https://github.com/jang1563/NegBioDB}
}
```
---
## License
This dataset is released under **CC BY-SA 4.0** — required by the viral clause of ChEMBL's CC BY-SA 3.0. All redistributed source data retain their original licenses (table above). AlphaMissense scores are **non-commercial only**; commercial users should remove the AlphaMissense columns before downstream use.
提供机构:
jang1563



