everycure/matrix-scores
收藏Hugging Face2026-04-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/everycure/matrix-scores
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: source
dtype: string
- name: target
dtype: string
- name: source_ec_id
dtype: string
- name: is_known_positive
dtype: bool
- name: is_known_negative
dtype: bool
- name: trial_sig_better
dtype: bool
- name: trial_non_sig_better
dtype: bool
- name: trial_sig_worse
dtype: bool
- name: trial_non_sig_worse
dtype: bool
- name: off_label
dtype: bool
- name: ec_indications_list_off_label
dtype: bool
- name: ec_indications_list_on_label
dtype: bool
- name: ec_indications_list
dtype: bool
- name: not treat score
dtype: float64
- name: untransformed_treat_score
dtype: float64
- name: unknown score
dtype: float64
- name: untransformed_rank
dtype: int64
- name: quantile_rank
dtype: float64
- name: rank_drug
dtype: int32
- name: quantile_drug
dtype: float64
- name: rank_disease
dtype: int32
- name: quantile_disease
dtype: float64
- name: transformed_treat_score
dtype: float64
- name: rank
dtype: int32
- name: y
dtype: int32
splits:
- name: train
num_bytes: 5000341679
num_examples: 39508102
download_size: 3812410227
dataset_size: 5000341679
configs:
- config_name: default
data_files:
- split: train
path: data/scores/train-*
---
## Dataset Description
This dataset contains the output of the MATRIX pipeline — Every Cure's computational drug repurposing scoring system. It provides ML-generated treatment probability scores for ~39.5 million drug-disease pairs, covering ~1,800 drugs × ~22,000 diseases.
> ⚠️ **Research use only.** These scores are the output of a computational research pipeline and do not constitute medical advice, clinical recommendations, or endorsement of any drug for any use. All findings require independent scientific and clinical validation before any clinical application.
## Column Definitions
### Identifiers
| Column | Type | Description |
|--------|------|-------------|
| `source` | string | Drug identifier in a standard ontology format (e.g. `CHEBI:28304`, `PUBCHEM:5284616`, `DRUGBANK:DB00945`, `UNII:...`) |
| `target` | string | Disease identifier in MONDO format (e.g. `MONDO:0003307`) |
| `source_ec_id` | string | Every Cure internal drug identifier (e.g. `EC:00780`) |
### Ground Truth Labels
These flags indicate whether a drug-disease pair appears in curated reference datasets used during model training and evaluation.
| Column | Type | Description |
|--------|------|-------------|
| `y` | int | Training label: `1` = known positive (treats), `0` = known negative (does not treat), `2` = unknown |
| `is_known_positive` | boolean | `true` if this pair is a confirmed drug-disease treatment association in the training data |
| `is_known_negative` | boolean | `true` if this pair is a confirmed non-treatment association in the training data |
| `trial_sig_better` | boolean | `true` if the drug showed statistically significant improvement over control in clinical trials for this disease |
| `trial_non_sig_better` | boolean | `true` if the drug showed non-significant improvement in clinical trials for this disease |
| `trial_sig_worse` | boolean | `true` if the drug showed statistically significant worsening in clinical trials for this disease |
| `trial_non_sig_worse` | boolean | `true` if the drug showed non-significant worsening in clinical trials for this disease |
| `off_label` | boolean | `true` if this pair is documented as an off-label drug use |
| `ec_indications_list` | boolean | `true` if the pair appears in the Every Cure indications list (on-label or off-label) |
| `ec_indications_list_on_label` | boolean | `true` if the pair appears as an on-label indication in the Every Cure indications list |
| `ec_indications_list_off_label` | boolean | `true` if the pair appears as an off-label indication in the Every Cure indications list |
### ML Prediction Scores
The model outputs three probability scores that sum to approximately 1.0 for each drug-disease pair.
| Column | Type | Description |
|--------|------|-------------|
| `untransformed_treat_score` | float | Raw model probability that the drug treats the disease (class 1). Higher = more likely to treat. |
| `not_treat_score` | float | Raw model probability that the drug does **not** treat the disease (class 0) |
| `unknown_score` | float | Raw model probability that the relationship is unknown or uncertain (class 2) |
| `transformed_treat_score` | float | Final treatment score after combining the raw treatment probability with drug-specific and disease-specific ranking signals (see below) |
### Rankings
Rankings are computed globally across all ~39.5M pairs. Lower rank = higher predicted treatment probability.
| Column | Type | Description |
|--------|------|-------------|
| `untransformed_rank` | int | Global rank of the pair by `untransformed_treat_score` (1 = highest raw score) |
| `quantile_rank` | float | `untransformed_rank` normalized to [0, 1]; values near 0 indicate top-ranked pairs |
| `rank` | int | Global rank of the pair by `transformed_treat_score` (1 = highest final score) |
| `rank_drug` | int | Rank of this pair among all diseases for the same drug (1 = top disease for this drug) |
| `quantile_drug` | float | `rank_drug` normalized to [0, 1] |
| `rank_disease` | int | Rank of this pair among all drugs for the same disease (1 = top drug for this disease) |
| `quantile_disease` | float | `rank_disease` normalized to [0, 1] |
### Score Transformation
The `transformed_treat_score` combines the raw model output with relative rankings to surface candidates that rank highly both globally and within their drug or disease context:
```
transformed_treat_score = untransformed_treat_score
+ drug_weight × rank_drug^(−decay_drug)
+ disease_weight × rank_disease^(−decay_disease)
```
## Loading the Dataset
```python
from datasets import load_dataset
ds = load_dataset("everycure/matrix-scores", split="train")
df = ds.to_pandas()
```
For large-scale use, load as a streaming dataset or use the Parquet files directly:
```python
import polars as pl
df = pl.read_parquet("hf://datasets/everycure/matrix-scores/data/train-*.parquet")
```
## Related Resources
- [Interactive heatmap explorer](https://prototypes.everycure.org/matrix-heatmap-public)
- [Open source pipeline](https://github.com/everycure-org/matrix)
- [Drug list](https://huggingface.co/datasets/everycure/drug-list) · [Disease list](https://huggingface.co/datasets/everycure/disease-list) · [KG nodes](https://huggingface.co/datasets/everycure/kg-nodes) · [KG edges](https://huggingface.co/datasets/everycure/kg-edges)
## Disclaimer
The scores and predictions shown in this tool are the output of a computational research pipeline and are intended for research purposes only. They do not constitute medical advice, clinical recommendations, or endorsement of any drug for any use. Every Cure makes no warranties, express or implied, regarding the accuracy, completeness, or fitness for any particular purpose of the information presented. Drug repurposing opportunities identified here require rigorous independent scientific and clinical validation before any clinical application.
提供机构:
everycure



