five

everycure/matrix-scores

收藏
Hugging Face2026-04-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/everycure/matrix-scores
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: source dtype: string - name: target dtype: string - name: source_ec_id dtype: string - name: is_known_positive dtype: bool - name: is_known_negative dtype: bool - name: trial_sig_better dtype: bool - name: trial_non_sig_better dtype: bool - name: trial_sig_worse dtype: bool - name: trial_non_sig_worse dtype: bool - name: off_label dtype: bool - name: ec_indications_list_off_label dtype: bool - name: ec_indications_list_on_label dtype: bool - name: ec_indications_list dtype: bool - name: not treat score dtype: float64 - name: untransformed_treat_score dtype: float64 - name: unknown score dtype: float64 - name: untransformed_rank dtype: int64 - name: quantile_rank dtype: float64 - name: rank_drug dtype: int32 - name: quantile_drug dtype: float64 - name: rank_disease dtype: int32 - name: quantile_disease dtype: float64 - name: transformed_treat_score dtype: float64 - name: rank dtype: int32 - name: y dtype: int32 splits: - name: train num_bytes: 5000341679 num_examples: 39508102 download_size: 3812410227 dataset_size: 5000341679 configs: - config_name: default data_files: - split: train path: data/scores/train-* --- ## Dataset Description This dataset contains the output of the MATRIX pipeline — Every Cure's computational drug repurposing scoring system. It provides ML-generated treatment probability scores for ~39.5 million drug-disease pairs, covering ~1,800 drugs × ~22,000 diseases. > ⚠️ **Research use only.** These scores are the output of a computational research pipeline and do not constitute medical advice, clinical recommendations, or endorsement of any drug for any use. All findings require independent scientific and clinical validation before any clinical application. ## Column Definitions ### Identifiers | Column | Type | Description | |--------|------|-------------| | `source` | string | Drug identifier in a standard ontology format (e.g. `CHEBI:28304`, `PUBCHEM:5284616`, `DRUGBANK:DB00945`, `UNII:...`) | | `target` | string | Disease identifier in MONDO format (e.g. `MONDO:0003307`) | | `source_ec_id` | string | Every Cure internal drug identifier (e.g. `EC:00780`) | ### Ground Truth Labels These flags indicate whether a drug-disease pair appears in curated reference datasets used during model training and evaluation. | Column | Type | Description | |--------|------|-------------| | `y` | int | Training label: `1` = known positive (treats), `0` = known negative (does not treat), `2` = unknown | | `is_known_positive` | boolean | `true` if this pair is a confirmed drug-disease treatment association in the training data | | `is_known_negative` | boolean | `true` if this pair is a confirmed non-treatment association in the training data | | `trial_sig_better` | boolean | `true` if the drug showed statistically significant improvement over control in clinical trials for this disease | | `trial_non_sig_better` | boolean | `true` if the drug showed non-significant improvement in clinical trials for this disease | | `trial_sig_worse` | boolean | `true` if the drug showed statistically significant worsening in clinical trials for this disease | | `trial_non_sig_worse` | boolean | `true` if the drug showed non-significant worsening in clinical trials for this disease | | `off_label` | boolean | `true` if this pair is documented as an off-label drug use | | `ec_indications_list` | boolean | `true` if the pair appears in the Every Cure indications list (on-label or off-label) | | `ec_indications_list_on_label` | boolean | `true` if the pair appears as an on-label indication in the Every Cure indications list | | `ec_indications_list_off_label` | boolean | `true` if the pair appears as an off-label indication in the Every Cure indications list | ### ML Prediction Scores The model outputs three probability scores that sum to approximately 1.0 for each drug-disease pair. | Column | Type | Description | |--------|------|-------------| | `untransformed_treat_score` | float | Raw model probability that the drug treats the disease (class 1). Higher = more likely to treat. | | `not_treat_score` | float | Raw model probability that the drug does **not** treat the disease (class 0) | | `unknown_score` | float | Raw model probability that the relationship is unknown or uncertain (class 2) | | `transformed_treat_score` | float | Final treatment score after combining the raw treatment probability with drug-specific and disease-specific ranking signals (see below) | ### Rankings Rankings are computed globally across all ~39.5M pairs. Lower rank = higher predicted treatment probability. | Column | Type | Description | |--------|------|-------------| | `untransformed_rank` | int | Global rank of the pair by `untransformed_treat_score` (1 = highest raw score) | | `quantile_rank` | float | `untransformed_rank` normalized to [0, 1]; values near 0 indicate top-ranked pairs | | `rank` | int | Global rank of the pair by `transformed_treat_score` (1 = highest final score) | | `rank_drug` | int | Rank of this pair among all diseases for the same drug (1 = top disease for this drug) | | `quantile_drug` | float | `rank_drug` normalized to [0, 1] | | `rank_disease` | int | Rank of this pair among all drugs for the same disease (1 = top drug for this disease) | | `quantile_disease` | float | `rank_disease` normalized to [0, 1] | ### Score Transformation The `transformed_treat_score` combines the raw model output with relative rankings to surface candidates that rank highly both globally and within their drug or disease context: ``` transformed_treat_score = untransformed_treat_score + drug_weight × rank_drug^(−decay_drug) + disease_weight × rank_disease^(−decay_disease) ``` ## Loading the Dataset ```python from datasets import load_dataset ds = load_dataset("everycure/matrix-scores", split="train") df = ds.to_pandas() ``` For large-scale use, load as a streaming dataset or use the Parquet files directly: ```python import polars as pl df = pl.read_parquet("hf://datasets/everycure/matrix-scores/data/train-*.parquet") ``` ## Related Resources - [Interactive heatmap explorer](https://prototypes.everycure.org/matrix-heatmap-public) - [Open source pipeline](https://github.com/everycure-org/matrix) - [Drug list](https://huggingface.co/datasets/everycure/drug-list) · [Disease list](https://huggingface.co/datasets/everycure/disease-list) · [KG nodes](https://huggingface.co/datasets/everycure/kg-nodes) · [KG edges](https://huggingface.co/datasets/everycure/kg-edges) ## Disclaimer The scores and predictions shown in this tool are the output of a computational research pipeline and are intended for research purposes only. They do not constitute medical advice, clinical recommendations, or endorsement of any drug for any use. Every Cure makes no warranties, express or implied, regarding the accuracy, completeness, or fitness for any particular purpose of the information presented. Drug repurposing opportunities identified here require rigorous independent scientific and clinical validation before any clinical application.
提供机构:
everycure
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作