metric-shift/metric-shift-benchmark
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/metric-shift/metric-shift-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license:
- cc-by-4.0
- mit
- other
task_categories:
- tabular-regression
tags:
- metric-shift
- benchmark
- scientific-ml
- cross-domain
- cross-metric-prediction
size_categories:
- 100K<n<1M
configs:
- config_name: zinc250k
data_files:
- split: features
path: zinc250k/features.csv
- split: labels
path: zinc250k/labels.csv
- config_name: air_quality
data_files:
- split: features
path: air_quality/features.csv
- split: labels
path: air_quality/labels.csv
- config_name: jarvis_materials
data_files:
- split: features
path: jarvis_materials/features.csv
- split: labels
path: jarvis_materials/labels.csv
- config_name: protein_fitness_expanded
data_files:
- split: features
path: protein_fitness_expanded/features.csv
- split: labels
path: protein_fitness_expanded/labels.csv
- config_name: drug_admet
data_files:
- split: features
path: drug_admet/features.csv
- split: labels
path: drug_admet/labels.csv
- config_name: climate_stations
data_files:
- split: features
path: climate_stations/features.csv
- split: labels
path: climate_stations/labels.csv
---
# Metric Shift Benchmark
A cross-domain benchmark for predicting expensive scientific measurements from
cheap surrogates, spanning **6 scientific fields** and **134 valid
(y1, y2) pairs** with a standardized evaluation protocol.
**Paper:** *Metric Shift: A Benchmark for Predicting Expensive Scientific
Measurements from Cheap Surrogates* (NeurIPS 2026 Evaluations & Datasets Track, under review)
## Benchmark Overview
| Dataset | Domain | Samples | Feat. dim | Labels | Valid pairs | License |
|---------|--------|---------|-----------|--------|-------------|---------|
| `zinc250k` | Drug Chemistry | 249,455 | 14 | 3 | 6 | ZINC academic-use, f... |
| `air_quality` | Environmental Science | 382,168 | 7 | 6 | 28 | CC-BY-4.0 (UCI ML Re... |
| `jarvis_materials` | Materials Science | 10,800 | 14 | 6 | 30 | Public domain / NIST... |
| `protein_fitness_expanded` | Protein Biology | 61,704 | 22 | 24 | 38 | MIT (ProteinGym aggr... |
| `drug_admet` | Pharmacology | 1,523 | 14 | 4 | 12 | CC-BY-4.0 (Polaris H... |
| `climate_stations` | Climate Science | 28,488 | 5 | 5 | 20 | CC-BY-4.0, dual attr... |
| **Total** | --- | **734,138** | --- | --- | **134** | --- |
## Problem: Metric Shift
Given a shared entity x (molecule, material, protein variant), a cheap source
metric y1, and an expensive target metric y2: can we use universally available
y1 to improve prediction of the sparsely labeled y2?
Key properties:
- y1 is **always available at test time** (cheap to measure for any new candidate)
- The input distribution p(x) is fixed; only the prediction target changes
- Unlike domain adaptation (shifts p(x)) or multi-task learning (co-predicts)
## Evaluation Protocol
- **Split:** 60% train / 20% val / 20% test at `split_seed=42`
- **Labeled ratio:** 20% of train (main setting); 1% and 5% for ablation
- **Seeds:** 5 model seeds per pair
- **Metrics:** R-squared and Spearman rho
- **Significance:** Paired t-test across seeds + Benjamini-Hochberg FDR at q=0.05
- **Aggregation:** Macro-median (per-dataset median, then cross-dataset median)
- **StandardScaler:** fit on labeled train only
## Usage
```python
import pandas as pd
# Load one sub-dataset
features = pd.read_csv("zinc250k/features.csv")
labels = pd.read_csv("zinc250k/labels.csv")
# Each (source, target) column pair in labels defines a Metric Shift task
# See metadata.json for the list of valid pairs with Spearman correlations
```
One-command reproduction of all tables and figures:
```bash
pip install metric-shift-benchmark
python -m metric_shift.run_all
```
## Dataset Details
### `zinc250k` — Drug Chemistry
249,455 drug-like molecules, 14 RDKit descriptors, 3 labels (logP, QED, SAS), 6 pairs
- **Source:** ZINC database (Irwin & Shoichet 2005; Sterling & Irwin 2015)
- **License:** ZINC academic-use, free redistribution with attribution
- **Features (14d):** `MolWt, HeavyAtomCount, NumHeteroatoms, NumValenceElectrons, TPSA, MolMR, HBA, HBD, NumRotatableBonds, RingCount, NumAromaticRings, FractionCSP3, BalabanJ, BertzCT`
- **Labels (3col):** `logP, QED, SAS`
### `air_quality` — Environmental Science
382,168 hourly records, 7 meteo features, 6 pollutants, 28 pairs
- **Source:** Beijing Multi-Site Air-Quality Dataset (Zhang et al. 2017)
- **License:** CC-BY-4.0 (UCI ML Repository)
- **Features (7d):** `TEMP, PRES, DEWP, RAIN, WSPM, wd_sin, wd_cos`
- **Labels (6col):** `PM25, PM10, SO2, NO2, CO, O3`
### `jarvis_materials` — Materials Science
10,800 inorganic crystals, 14 composition descriptors, 6 labels, 30 pairs
- **Source:** JARVIS-DFT 3D (Choudhary et al. 2020)
- **License:** Public domain / NIST (17 USC §105)
- **Features (14d):** `mean_Z, std_Z, mean_X, std_X, mean_row, std_row, mean_group, std_group, mean_atomic_mass, std_atomic_mass, density, volume_per_atom, n_sites, packing_fraction`
- **Labels (6col):** `formation_energy_peratom, optb88vdw_bandgap, bulk_modulus_kv, shear_modulus_gv, n_seebeck, p_seebeck`
### `protein_fitness_expanded` — Protein Biology
61,704 variants, 22-d mutation features, 24 DMS assays, 38 within-protein pairs
- **Source:** ProteinGym substitution benchmark (Notin et al. 2023)
- **License:** MIT (ProteinGym aggregation)
- **Features (22d):** `protein_id, n_mutations, AA_A_diff, AA_C_diff, AA_D_diff, AA_E_diff, AA_F_diff, AA_G_diff, AA_H_diff, AA_I_diff, AA_K_diff, AA_L_diff, AA_M_diff, AA_N_diff, AA_P_diff, AA_Q_diff, AA_R_diff, AA_S_diff, AA_T_diff, AA_V_diff, AA_W_diff, AA_Y_diff`
- **Labels (24col):** `p53_null_etoposide, p53_null_nutlin, p53_wt_nutlin, blat_deng_2012, blat_firnberg_2014, blat_jacquier_2013, blat_stiffler_2015, pten_matreyek_2021, pten_mighell_2018, cp2c9_amorosi_abundance_2021, cp2c9_amorosi_activity_2021, hsp82_flynn_2019, hsp82_mishra_2016, spike_starr_bind_2020, spike_starr_expr_2020, a0a2z5u3z0_doud_2016, a0a2z5u3z0_wu_2014, rl401_mavor_2016, rl401_roscoe_2013, rl401_roscoe_2014, ccdb_adkar_2012, ccdb_tripathi_2016, vkor1_chiasson_abundance_2020, vkor1_chiasson_activity_2020`
### `drug_admet` — Pharmacology
1,523 compounds, 14 RDKit descriptors, 4 ADME endpoints, 12 pairs
- **Source:** Biogen ADME-Fang v1 (Fang et al. 2023)
- **License:** CC-BY-4.0 (Polaris Hub)
- **Features (14d):** `MolWt, HeavyAtomCount, NumHBD, NumHBA, TPSA, MolLogP, NumRotatableBonds, RingCount, NumAromaticRings, FractionCSP3, MolMR, BertzCT, BalabanJ, NumHeteroatoms`
- **Labels (4col):** `LOG_HLM_CLint, LOG_RLM_CLint, LOG_SOLUBILITY, LOG_MDR1-MDCK_ER`
### `climate_stations` — Climate Science
28,488 daily records, 5 context features, 5 climate variables, 20 pairs
- **Source:** Open-Meteo Historical Weather API / ERA5 reanalysis
- **License:** CC-BY-4.0, dual attribution to Open-Meteo and Copernicus C3S/ERA5
- **Features (5d):** `lat, lon, day_sin, day_cos, year_norm`
- **Labels (5col):** `temp_max, temp_min, precip, windspeed, solar_radiation`
## Responsible AI
- **Personal / sensitive data:** None. All datasets contain scientific measurements
on molecules, materials, proteins, pollutants, or climate variables. No human
subjects, no personally identifiable information.
- **Intended use:** Benchmarking ML methods for the Metric Shift problem. Not
intended for direct clinical, regulatory, or safety-critical deployment.
- **Known limitations:** (1) All six datasets are re-curations of existing public
sources; our contribution is pair construction, validity filter, and protocol.
(2) Domain coverage spans chemistry, biology, materials, environment, and
climate --- not yet high-energy physics, astronomy, or social science.
(3) Feature spaces are intentionally low-dimensional (5--22d) to isolate the
contribution of y1; higher-dimensional encoders may change relative method
rankings.
- **Potential misuse:** drug_admet contains ADME measurements that could
theoretically inform adverse drug design; however, the 1,523-compound dataset
is far too small and coarse for such purposes, and all data is already public.
## Maintenance
The authors commit to maintaining this repository for at least 2 years
post-publication, with semantic versioning (v1.0, v1.1, ...) and a CHANGELOG
for every split, filter, or protocol change.
## Citation
```bibtex
@inproceedings{metric_shift_2026,
title={Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates},
author={Anonymous},
booktitle={NeurIPS 2026 Evaluations and Datasets Track},
year={2026},
note={Under review}
}
```
license:
- cc-by-4.0
- mit
- other
task_categories:
- tabular-regression
tags:
- metric-shift
- benchmark
- scientific-ml
- cross-domain
- cross-metric-prediction
size_categories:
- 100K<n<1M
configs:
- config_name: zinc250k
data_files:
- split: features
path: zinc250k/features.csv
- split: labels
path: zinc250k/labels.csv
- config_name: air_quality
data_files:
- split: features
path: air_quality/features.csv
- split: labels
path: air_quality/labels.csv
- config_name: jarvis_materials
data_files:
- split: features
path: jarvis_materials/features.csv
- split: labels
path: jarvis_materials/labels.csv
- config_name: protein_fitness_expanded
data_files:
- split: features
path: protein_fitness_expanded/features.csv
- split: labels
path: protein_fitness_expanded/labels.csv
- config_name: drug_admet
data_files:
- split: features
path: drug_admet/features.csv
- split: labels
path: drug_admet/labels.csv
- config_name: climate_stations
data_files:
- split: features
path: climate_stations/features.csv
- split: labels
path: climate_stations/labels.csv
# 度量偏移基准测试集(Metric Shift Benchmark)
一个跨域基准测试集,旨在从廉价替代度量中预测成本高昂的科学测量结果,涵盖**6个科学领域**与**134组有效(y1,y2)度量对**,并配备标准化评估流程。
**相关论文**:*《度量偏移:基于廉价替代度量的高成本科学测量预测基准测试集》*(NeurIPS 2026评估与数据集赛道,待审)
## 基准测试集概览
| 数据集 | 研究领域 | 样本量 | 特征维度 | 标签数 | 有效度量对数量 | 许可证 |
|---------|--------|---------|-----------|--------|-------------|---------|
| `zinc250k` | 药物化学 | 249,455 | 14 | 3 | 6 | ZINC学术使用许可,标注来源后可免费再分发 |
| `air_quality` | 环境科学 | 382,168 | 7 | 6 | 28 | CC BY 4.0(UCI机器学习仓库) |
| `jarvis_materials` | 材料科学 | 10,800 | 14 | 6 | 30 | 公有领域 / 美国国家标准与技术研究院(NIST)许可 |
| `protein_fitness_expanded` | 蛋白质生物学 | 61,704 | 22 | 24 | 38 | MIT许可证(ProteinGym聚合数据集) |
| `drug_admet` | 药理学 | 1,523 | 14 | 4 | 12 | CC BY 4.0(Polaris Hub) |
| `climate_stations` | 气候科学 | 28,488 | 5 | 5 | 20 | CC BY 4.0,需同时标注Open-Meteo与哥白尼C3S/ERA5来源 |
| **总计** | --- | **734,138** | --- | --- | **134** | --- |
## 问题定义:度量偏移(metric-shift)
给定共享实体x(分子、材料、蛋白质变体)、廉价源度量y1与高成本目标度量y2,我们能否利用通用可得的y1来优化对稀疏标注y2的预测?
核心特性:
- y1**在测试阶段始终可用**(对任意新候选样本,测量成本极低)
- 输入分布p(x)固定不变,仅预测目标发生变化
- 不同于域自适应(会改变p(x)分布)或多任务学习(同时进行多目标预测)
## 评估协议
- **数据集划分**:以`split_seed=42`为随机种子,按60%训练集/20%验证集/20%测试集划分
- **标注比例**:训练集的20%为标注样本(主实验设置);消融实验分别采用1%与5%的标注比例
- **随机种子**:每组度量对使用5个模型训练随机种子
- **评估指标**:决定系数(R-squared)与斯皮尔曼秩相关系数(Spearman rho)
- **显著性检验**:基于多种子结果的配对t检验,结合q=0.05的Benjamini-Hochberg错误发现率(FDR)校正
- **结果聚合**:宏中位数法(先计算每个数据集内的中位数,再计算跨数据集的中位数)
- **标准化处理**:仅基于标注训练集拟合StandardScaler标准化器
## 使用方法
python
import pandas as pd
# 加载单个子数据集
features = pd.read_csv("zinc250k/features.csv")
labels = pd.read_csv("zinc250k/labels.csv")
# 标签文件中每一组(源度量,目标度量)列对均构成一个度量偏移任务
# 可参考metadata.json文件获取带有斯皮尔曼相关系数的有效度量对列表
一键复现所有表格与图表:
bash
pip install metric-shift-benchmark
python -m metric_shift.run_all
## 数据集详情
### `zinc250k` — 药物化学
包含249,455个类药物分子,14个RDKit描述符(RDKit descriptor),3个标签(logP、QED、SAS),共6组度量对
- **数据来源**:ZINC数据库(Irwin与Shoichet 2005;Sterling与Irwin 2015)
- **许可证**:ZINC学术使用许可,标注来源后可免费再分发
- **特征(14维)**:`MolWt、重原子数、杂原子数、价电子数、拓扑极表面积(TPSA)、摩尔摩尔折射率(MolMR)、氢键受体数(HBA)、氢键供体数(HBD)、可旋转键数、环计数、芳香环数、sp3杂化碳比例(FractionCSP3)、BalabanJ指数、BertzCT指数`
- **标签(3列)**:`logP、QED、SAS`
### `air_quality` — 环境科学
包含382,168条小时级记录,7个气象特征,6种污染物,共28组度量对
- **数据来源**:北京多站点空气质量数据集(Zhang等人2017)
- **许可证**:CC BY 4.0(UCI机器学习仓库)
- **特征(7维)**:`温度(TEMP)、气压(PRES)、露点温度(DEWP)、降水量(RAIN)、风速(WSPM)、风向正弦分量(wd_sin)、风向余弦分量(wd_cos)`
- **标签(6列)**:`PM2.5、PM10、二氧化硫(SO2)、二氧化氮(NO2)、一氧化碳(CO)、臭氧(O3)`
### `jarvis_materials` — 材料科学
包含10,800个无机晶体,14个组分描述符,6个标签,共30组度量对
- **数据来源**:JARVIS-DFT 3D数据库(Choudhary等人2020)
- **许可证**:公有领域 / 美国国家标准与技术研究院(NIST)许可(符合美国版权法第17编第105条)
- **特征(14维)**:`平均原子序数(mean_Z)、原子序数标准差(std_Z)、平均电负性(mean_X)、电负性标准差(std_X)、平均周期数(mean_row)、周期数标准差(std_row)、平均族数(mean_group)、族数标准差(std_group)、平均原子量(mean_atomic_mass)、原子量标准差(std_atomic_mass)、密度、单原子体积、晶位数量、堆积分数(packing_fraction)`
- **标签(6列)**:`每原子形成能(formation_energy_peratom)、optb88vdw能带隙(optb88vdw_bandgap)、体积模量(bulk_modulus_kv)、剪切模量(shear_modulus_gv)、n型塞贝克系数(n_seebeck)、p型塞贝克系数(p_seebeck)`
### `protein_fitness_expanded` — 蛋白质生物学
包含61,704个蛋白质变体,22维突变特征,24个深度突变扫描(Deep Mutational Scan, DMS)实验结果,共38组蛋白内度量对
- **数据来源**:ProteinGym替换突变基准测试集(Notin等人2023)
- **许可证**:MIT许可证(ProteinGym聚合数据集)
- **特征(22维)**:`蛋白质ID(protein_id)、突变数量(n_mutations)、丙氨酸(A)差异、半胱氨酸(C)差异、天冬氨酸(D)差异、谷氨酸(E)差异、苯丙氨酸(F)差异、甘氨酸(G)差异、组氨酸(H)差异、异亮氨酸(I)差异、赖氨酸(K)差异、亮氨酸(L)差异、甲硫氨酸(M)差异、天冬酰胺(N)差异、脯氨酸(P)差异、谷氨酰胺(Q)差异、精氨酸(R)差异、丝氨酸(S)差异、苏氨酸(T)差异、缬氨酸(V)差异、色氨酸(W)差异、酪氨酸(Y)差异`
- **标签(24列)**:`p53_null_etoposide、p53_null_nutlin、p53_wt_nutlin、blat_deng_2012、blat_firnberg_2014、blat_jacquier_2013、blat_stiffler_2015、pten_matreyek_2021、pten_mighell_2018、cp2c9_amorosi_abundance_2021、cp2c9_amorosi_activity_2021、hsp82_flynn_2019、hsp82_mishra_2016、spike_starr_bind_2020、spike_starr_expr_2020、a0a2z5u3z0_doud_2016、a0a2z5u3z0_wu_2014、rl401_mavor_2016、rl401_roscoe_2013、rl401_roscoe_2014、ccdb_adkar_2012、ccdb_tripathi_2016、vkor1_chiasson_abundance_2020、vkor1_chiasson_activity_2020`
### `drug_admet` — 药理学
包含1,523个化合物,14个RDKit描述符(RDKit descriptor),4个药物代谢动力学(ADME)终点指标,共12组度量对
- **数据来源**:Biogen ADME-Fang v1数据集(Fang等人2023)
- **许可证**:CC BY 4.0(Polaris Hub)
- **特征(14维)**:`分子量(MolWt)、重原子数、氢键供体数(NumHBD)、氢键受体数(NumHBA)、拓扑极表面积(TPSA)、辛醇-水分配系数(MolLogP)、可旋转键数、环计数、芳香环数、sp3杂化碳比例(FractionCSP3)、摩尔摩尔折射率(MolMR)、BertzCT指数、BalabanJ指数、杂原子数(NumHeteroatoms)`
- **标签(4列)**:`人肝微粒体内在清除率对数(LOG_HLM_CLint)、大鼠肝微粒体内在清除率对数(LOG_RLM_CLint)、溶解度对数(LOG_SOLUBILITY)、MDR1-MDCK细胞外排比对数(LOG_MDR1-MDCK_ER)`
### `climate_stations` — 气候科学
包含28,488条日级记录,5个上下文特征,5个气候变量,共20组度量对
- **数据来源**:Open-Meteo历史天气API / ERA5再分析数据集
- **许可证**:CC BY 4.0,需同时标注Open-Meteo与哥白尼C3S/ERA5来源
- **特征(5维)**:`纬度(lat)、经度(lon)、日周期正弦分量(day_sin)、日周期余弦分量(day_cos)、归一化年份(year_norm)`
- **标签(5列)**:`最高气温(temp_max)、最低气温(temp_min)、降水量(precip)、风速(windspeed)、太阳辐射(solar_radiation)`
## 负责任AI
- **个人/敏感数据**:无敏感数据。所有数据集仅包含分子、材料、蛋白质、污染物或气候变量的科学测量结果,未涉及人类受试者与任何个人可识别信息。
- **预期用途**:仅用于度量偏移问题的机器学习方法基准测试,不得直接用于临床、监管或安全关键型部署场景。
- **已知局限性**:(1) 本基准的6个子数据集均为现有公开数据集的重新整理,本工作的贡献在于度量对构建、有效性筛选与标准化评估协议。(2) 研究领域覆盖化学、生物学、材料科学、环境科学与气候科学,但尚未涉及高能物理、天文学或社会科学领域。(3) 特征空间设计为低维度(5~22维)以隔离源度量y1的贡献,使用高维编码器可能会改变不同方法的相对排名。
- **潜在误用风险**:`drug_admet`数据集包含ADME测量结果,理论上可用于辅助药物不良反应设计,但该数据集仅包含1,523个化合物,规模过小且精度粗糙,无法满足此类需求,且所有数据均已公开。
## 维护说明
作者承诺在论文发表后至少2年内维护该仓库,采用语义化版本号(v1.0、v1.1等),并针对每次数据集划分、筛选或评估协议变更提供CHANGELOG记录。
## 引用格式
bibtex
@inproceedings{metric_shift_2026,
title={Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates},
author={Anonymous},
booktitle={NeurIPS 2026 Evaluations and Datasets Track},
year={2026},
note={Under review}
}
提供机构:
metric-shift



