metric-shift/metric-shift-benchmark

Name: metric-shift/metric-shift-benchmark
Creator: metric-shift
Published: 2026-04-16 10:40:23
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/metric-shift/metric-shift-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: - cc-by-4.0 - mit - other task_categories: - tabular-regression tags: - metric-shift - benchmark - scientific-ml - cross-domain - cross-metric-prediction size_categories: - 100K<n<1M configs: - config_name: zinc250k data_files: - split: features path: zinc250k/features.csv - split: labels path: zinc250k/labels.csv - config_name: air_quality data_files: - split: features path: air_quality/features.csv - split: labels path: air_quality/labels.csv - config_name: jarvis_materials data_files: - split: features path: jarvis_materials/features.csv - split: labels path: jarvis_materials/labels.csv - config_name: protein_fitness_expanded data_files: - split: features path: protein_fitness_expanded/features.csv - split: labels path: protein_fitness_expanded/labels.csv - config_name: drug_admet data_files: - split: features path: drug_admet/features.csv - split: labels path: drug_admet/labels.csv - config_name: climate_stations data_files: - split: features path: climate_stations/features.csv - split: labels path: climate_stations/labels.csv --- # Metric Shift Benchmark A cross-domain benchmark for predicting expensive scientific measurements from cheap surrogates, spanning **6 scientific fields** and **134 valid (y1, y2) pairs** with a standardized evaluation protocol. **Paper:** *Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates* (NeurIPS 2026 Evaluations & Datasets Track, under review) ## Benchmark Overview | Dataset | Domain | Samples | Feat. dim | Labels | Valid pairs | License | |---------|--------|---------|-----------|--------|-------------|---------| | `zinc250k` | Drug Chemistry | 249,455 | 14 | 3 | 6 | ZINC academic-use, f... | | `air_quality` | Environmental Science | 382,168 | 7 | 6 | 28 | CC-BY-4.0 (UCI ML Re... | | `jarvis_materials` | Materials Science | 10,800 | 14 | 6 | 30 | Public domain / NIST... | | `protein_fitness_expanded` | Protein Biology | 61,704 | 22 | 24 | 38 | MIT (ProteinGym aggr... | | `drug_admet` | Pharmacology | 1,523 | 14 | 4 | 12 | CC-BY-4.0 (Polaris H... | | `climate_stations` | Climate Science | 28,488 | 5 | 5 | 20 | CC-BY-4.0, dual attr... | | **Total** | --- | **734,138** | --- | --- | **134** | --- | ## Problem: Metric Shift Given a shared entity x (molecule, material, protein variant), a cheap source metric y1, and an expensive target metric y2: can we use universally available y1 to improve prediction of the sparsely labeled y2? Key properties: - y1 is **always available at test time** (cheap to measure for any new candidate) - The input distribution p(x) is fixed; only the prediction target changes - Unlike domain adaptation (shifts p(x)) or multi-task learning (co-predicts) ## Evaluation Protocol - **Split:** 60% train / 20% val / 20% test at `split_seed=42` - **Labeled ratio:** 20% of train (main setting); 1% and 5% for ablation - **Seeds:** 5 model seeds per pair - **Metrics:** R-squared and Spearman rho - **Significance:** Paired t-test across seeds + Benjamini-Hochberg FDR at q=0.05 - **Aggregation:** Macro-median (per-dataset median, then cross-dataset median) - **StandardScaler:** fit on labeled train only ## Usage ```python import pandas as pd # Load one sub-dataset features = pd.read_csv("zinc250k/features.csv") labels = pd.read_csv("zinc250k/labels.csv") # Each (source, target) column pair in labels defines a Metric Shift task # See metadata.json for the list of valid pairs with Spearman correlations ``` One-command reproduction of all tables and figures: ```bash pip install metric-shift-benchmark python -m metric_shift.run_all ``` ## Dataset Details ### `zinc250k` — Drug Chemistry 249,455 drug-like molecules, 14 RDKit descriptors, 3 labels (logP, QED, SAS), 6 pairs - **Source:** ZINC database (Irwin & Shoichet 2005; Sterling & Irwin 2015) - **License:** ZINC academic-use, free redistribution with attribution - **Features (14d):** `MolWt, HeavyAtomCount, NumHeteroatoms, NumValenceElectrons, TPSA, MolMR, HBA, HBD, NumRotatableBonds, RingCount, NumAromaticRings, FractionCSP3, BalabanJ, BertzCT` - **Labels (3col):** `logP, QED, SAS` ### `air_quality` — Environmental Science 382,168 hourly records, 7 meteo features, 6 pollutants, 28 pairs - **Source:** Beijing Multi-Site Air-Quality Dataset (Zhang et al. 2017) - **License:** CC-BY-4.0 (UCI ML Repository) - **Features (7d):** `TEMP, PRES, DEWP, RAIN, WSPM, wd_sin, wd_cos` - **Labels (6col):** `PM25, PM10, SO2, NO2, CO, O3` ### `jarvis_materials` — Materials Science 10,800 inorganic crystals, 14 composition descriptors, 6 labels, 30 pairs - **Source:** JARVIS-DFT 3D (Choudhary et al. 2020) - **License:** Public domain / NIST (17 USC §105) - **Features (14d):** `mean_Z, std_Z, mean_X, std_X, mean_row, std_row, mean_group, std_group, mean_atomic_mass, std_atomic_mass, density, volume_per_atom, n_sites, packing_fraction` - **Labels (6col):** `formation_energy_peratom, optb88vdw_bandgap, bulk_modulus_kv, shear_modulus_gv, n_seebeck, p_seebeck` ### `protein_fitness_expanded` — Protein Biology 61,704 variants, 22-d mutation features, 24 DMS assays, 38 within-protein pairs - **Source:** ProteinGym substitution benchmark (Notin et al. 2023) - **License:** MIT (ProteinGym aggregation) - **Features (22d):** `protein_id, n_mutations, AA_A_diff, AA_C_diff, AA_D_diff, AA_E_diff, AA_F_diff, AA_G_diff, AA_H_diff, AA_I_diff, AA_K_diff, AA_L_diff, AA_M_diff, AA_N_diff, AA_P_diff, AA_Q_diff, AA_R_diff, AA_S_diff, AA_T_diff, AA_V_diff, AA_W_diff, AA_Y_diff` - **Labels (24col):** `p53_null_etoposide, p53_null_nutlin, p53_wt_nutlin, blat_deng_2012, blat_firnberg_2014, blat_jacquier_2013, blat_stiffler_2015, pten_matreyek_2021, pten_mighell_2018, cp2c9_amorosi_abundance_2021, cp2c9_amorosi_activity_2021, hsp82_flynn_2019, hsp82_mishra_2016, spike_starr_bind_2020, spike_starr_expr_2020, a0a2z5u3z0_doud_2016, a0a2z5u3z0_wu_2014, rl401_mavor_2016, rl401_roscoe_2013, rl401_roscoe_2014, ccdb_adkar_2012, ccdb_tripathi_2016, vkor1_chiasson_abundance_2020, vkor1_chiasson_activity_2020` ### `drug_admet` — Pharmacology 1,523 compounds, 14 RDKit descriptors, 4 ADME endpoints, 12 pairs - **Source:** Biogen ADME-Fang v1 (Fang et al. 2023) - **License:** CC-BY-4.0 (Polaris Hub) - **Features (14d):** `MolWt, HeavyAtomCount, NumHBD, NumHBA, TPSA, MolLogP, NumRotatableBonds, RingCount, NumAromaticRings, FractionCSP3, MolMR, BertzCT, BalabanJ, NumHeteroatoms` - **Labels (4col):** `LOG_HLM_CLint, LOG_RLM_CLint, LOG_SOLUBILITY, LOG_MDR1-MDCK_ER` ### `climate_stations` — Climate Science 28,488 daily records, 5 context features, 5 climate variables, 20 pairs - **Source:** Open-Meteo Historical Weather API / ERA5 reanalysis - **License:** CC-BY-4.0, dual attribution to Open-Meteo and Copernicus C3S/ERA5 - **Features (5d):** `lat, lon, day_sin, day_cos, year_norm` - **Labels (5col):** `temp_max, temp_min, precip, windspeed, solar_radiation` ## Responsible AI - **Personal / sensitive data:** None. All datasets contain scientific measurements on molecules, materials, proteins, pollutants, or climate variables. No human subjects, no personally identifiable information. - **Intended use:** Benchmarking ML methods for the Metric Shift problem. Not intended for direct clinical, regulatory, or safety-critical deployment. - **Known limitations:** (1) All six datasets are re-curations of existing public sources; our contribution is pair construction, validity filter, and protocol. (2) Domain coverage spans chemistry, biology, materials, environment, and climate --- not yet high-energy physics, astronomy, or social science. (3) Feature spaces are intentionally low-dimensional (5--22d) to isolate the contribution of y1; higher-dimensional encoders may change relative method rankings. - **Potential misuse:** drug_admet contains ADME measurements that could theoretically inform adverse drug design; however, the 1,523-compound dataset is far too small and coarse for such purposes, and all data is already public. ## Maintenance The authors commit to maintaining this repository for at least 2 years post-publication, with semantic versioning (v1.0, v1.1, ...) and a CHANGELOG for every split, filter, or protocol change. ## Citation ```bibtex @inproceedings{metric_shift_2026, title={Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates}, author={Anonymous}, booktitle={NeurIPS 2026 Evaluations and Datasets Track}, year={2026}, note={Under review} } ```

license: - cc-by-4.0 - mit - other task_categories: - tabular-regression tags: - metric-shift - benchmark - scientific-ml - cross-domain - cross-metric-prediction size_categories: - 100K<n<1M configs: - config_name: zinc250k data_files: - split: features path: zinc250k/features.csv - split: labels path: zinc250k/labels.csv - config_name: air_quality data_files: - split: features path: air_quality/features.csv - split: labels path: air_quality/labels.csv - config_name: jarvis_materials data_files: - split: features path: jarvis_materials/features.csv - split: labels path: jarvis_materials/labels.csv - config_name: protein_fitness_expanded data_files: - split: features path: protein_fitness_expanded/features.csv - split: labels path: protein_fitness_expanded/labels.csv - config_name: drug_admet data_files: - split: features path: drug_admet/features.csv - split: labels path: drug_admet/labels.csv - config_name: climate_stations data_files: - split: features path: climate_stations/features.csv - split: labels path: climate_stations/labels.csv # 度量偏移基准测试集（Metric Shift Benchmark）一个跨域基准测试集，旨在从廉价替代度量中预测成本高昂的科学测量结果，涵盖**6个科学领域**与**134组有效(y1,y2)度量对**，并配备标准化评估流程。 **相关论文**：*《度量偏移：基于廉价替代度量的高成本科学测量预测基准测试集》*（NeurIPS 2026评估与数据集赛道，待审） ## 基准测试集概览 | 数据集 | 研究领域 | 样本量 | 特征维度 | 标签数 | 有效度量对数量 | 许可证 | |---------|--------|---------|-----------|--------|-------------|---------| | `zinc250k` | 药物化学 | 249,455 | 14 | 3 | 6 | ZINC学术使用许可，标注来源后可免费再分发 | | `air_quality` | 环境科学 | 382,168 | 7 | 6 | 28 | CC BY 4.0（UCI机器学习仓库） | | `jarvis_materials` | 材料科学 | 10,800 | 14 | 6 | 30 | 公有领域 / 美国国家标准与技术研究院（NIST）许可 | | `protein_fitness_expanded` | 蛋白质生物学 | 61,704 | 22 | 24 | 38 | MIT许可证（ProteinGym聚合数据集） | | `drug_admet` | 药理学 | 1,523 | 14 | 4 | 12 | CC BY 4.0（Polaris Hub） | | `climate_stations` | 气候科学 | 28,488 | 5 | 5 | 20 | CC BY 4.0，需同时标注Open-Meteo与哥白尼C3S/ERA5来源 | | **总计** | --- | **734,138** | --- | --- | **134** | --- | ## 问题定义：度量偏移（metric-shift）给定共享实体x（分子、材料、蛋白质变体）、廉价源度量y1与高成本目标度量y2，我们能否利用通用可得的y1来优化对稀疏标注y2的预测？核心特性： - y1**在测试阶段始终可用**（对任意新候选样本，测量成本极低） - 输入分布p(x)固定不变，仅预测目标发生变化 - 不同于域自适应（会改变p(x)分布）或多任务学习（同时进行多目标预测） ## 评估协议 - **数据集划分**：以`split_seed=42`为随机种子，按60%训练集/20%验证集/20%测试集划分 - **标注比例**：训练集的20%为标注样本（主实验设置）；消融实验分别采用1%与5%的标注比例 - **随机种子**：每组度量对使用5个模型训练随机种子 - **评估指标**：决定系数（R-squared）与斯皮尔曼秩相关系数（Spearman rho） - **显著性检验**：基于多种子结果的配对t检验，结合q=0.05的Benjamini-Hochberg错误发现率（FDR）校正 - **结果聚合**：宏中位数法（先计算每个数据集内的中位数，再计算跨数据集的中位数） - **标准化处理**：仅基于标注训练集拟合StandardScaler标准化器 ## 使用方法 python import pandas as pd # 加载单个子数据集 features = pd.read_csv("zinc250k/features.csv") labels = pd.read_csv("zinc250k/labels.csv") # 标签文件中每一组（源度量，目标度量）列对均构成一个度量偏移任务 # 可参考metadata.json文件获取带有斯皮尔曼相关系数的有效度量对列表一键复现所有表格与图表： bash pip install metric-shift-benchmark python -m metric_shift.run_all ## 数据集详情 ### `zinc250k` — 药物化学包含249,455个类药物分子，14个RDKit描述符（RDKit descriptor），3个标签（logP、QED、SAS），共6组度量对 - **数据来源**：ZINC数据库（Irwin与Shoichet 2005；Sterling与Irwin 2015） - **许可证**：ZINC学术使用许可，标注来源后可免费再分发 - **特征（14维）**：`MolWt、重原子数、杂原子数、价电子数、拓扑极表面积（TPSA）、摩尔摩尔折射率（MolMR）、氢键受体数（HBA）、氢键供体数（HBD）、可旋转键数、环计数、芳香环数、sp3杂化碳比例（FractionCSP3）、BalabanJ指数、BertzCT指数` - **标签（3列）**：`logP、QED、SAS` ### `air_quality` — 环境科学包含382,168条小时级记录，7个气象特征，6种污染物，共28组度量对 - **数据来源**：北京多站点空气质量数据集（Zhang等人2017） - **许可证**：CC BY 4.0（UCI机器学习仓库） - **特征（7维）**：`温度（TEMP）、气压（PRES）、露点温度（DEWP）、降水量（RAIN）、风速（WSPM）、风向正弦分量（wd_sin）、风向余弦分量（wd_cos）` - **标签（6列）**：`PM2.5、PM10、二氧化硫（SO2）、二氧化氮（NO2）、一氧化碳（CO）、臭氧（O3）` ### `jarvis_materials` — 材料科学包含10,800个无机晶体，14个组分描述符，6个标签，共30组度量对 - **数据来源**：JARVIS-DFT 3D数据库（Choudhary等人2020） - **许可证**：公有领域 / 美国国家标准与技术研究院（NIST）许可（符合美国版权法第17编第105条） - **特征（14维）**：`平均原子序数（mean_Z）、原子序数标准差（std_Z）、平均电负性（mean_X）、电负性标准差（std_X）、平均周期数（mean_row）、周期数标准差（std_row）、平均族数（mean_group）、族数标准差（std_group）、平均原子量（mean_atomic_mass）、原子量标准差（std_atomic_mass）、密度、单原子体积、晶位数量、堆积分数（packing_fraction）` - **标签（6列）**：`每原子形成能（formation_energy_peratom）、optb88vdw能带隙（optb88vdw_bandgap）、体积模量（bulk_modulus_kv）、剪切模量（shear_modulus_gv）、n型塞贝克系数（n_seebeck）、p型塞贝克系数（p_seebeck）` ### `protein_fitness_expanded` — 蛋白质生物学包含61,704个蛋白质变体，22维突变特征，24个深度突变扫描（Deep Mutational Scan, DMS）实验结果，共38组蛋白内度量对 - **数据来源**：ProteinGym替换突变基准测试集（Notin等人2023） - **许可证**：MIT许可证（ProteinGym聚合数据集） - **特征（22维）**：`蛋白质ID（protein_id）、突变数量（n_mutations）、丙氨酸（A）差异、半胱氨酸（C）差异、天冬氨酸（D）差异、谷氨酸（E）差异、苯丙氨酸（F）差异、甘氨酸（G）差异、组氨酸（H）差异、异亮氨酸（I）差异、赖氨酸（K）差异、亮氨酸（L）差异、甲硫氨酸（M）差异、天冬酰胺（N）差异、脯氨酸（P）差异、谷氨酰胺（Q）差异、精氨酸（R）差异、丝氨酸（S）差异、苏氨酸（T）差异、缬氨酸（V）差异、色氨酸（W）差异、酪氨酸（Y）差异` - **标签（24列）**：`p53_null_etoposide、p53_null_nutlin、p53_wt_nutlin、blat_deng_2012、blat_firnberg_2014、blat_jacquier_2013、blat_stiffler_2015、pten_matreyek_2021、pten_mighell_2018、cp2c9_amorosi_abundance_2021、cp2c9_amorosi_activity_2021、hsp82_flynn_2019、hsp82_mishra_2016、spike_starr_bind_2020、spike_starr_expr_2020、a0a2z5u3z0_doud_2016、a0a2z5u3z0_wu_2014、rl401_mavor_2016、rl401_roscoe_2013、rl401_roscoe_2014、ccdb_adkar_2012、ccdb_tripathi_2016、vkor1_chiasson_abundance_2020、vkor1_chiasson_activity_2020` ### `drug_admet` — 药理学包含1,523个化合物，14个RDKit描述符（RDKit descriptor），4个药物代谢动力学（ADME）终点指标，共12组度量对 - **数据来源**：Biogen ADME-Fang v1数据集（Fang等人2023） - **许可证**：CC BY 4.0（Polaris Hub） - **特征（14维）**：`分子量（MolWt）、重原子数、氢键供体数（NumHBD）、氢键受体数（NumHBA）、拓扑极表面积（TPSA）、辛醇-水分配系数（MolLogP）、可旋转键数、环计数、芳香环数、sp3杂化碳比例（FractionCSP3）、摩尔摩尔折射率（MolMR）、BertzCT指数、BalabanJ指数、杂原子数（NumHeteroatoms）` - **标签（4列）**：`人肝微粒体内在清除率对数（LOG_HLM_CLint）、大鼠肝微粒体内在清除率对数（LOG_RLM_CLint）、溶解度对数（LOG_SOLUBILITY）、MDR1-MDCK细胞外排比对数（LOG_MDR1-MDCK_ER）` ### `climate_stations` — 气候科学包含28,488条日级记录，5个上下文特征，5个气候变量，共20组度量对 - **数据来源**：Open-Meteo历史天气API / ERA5再分析数据集 - **许可证**：CC BY 4.0，需同时标注Open-Meteo与哥白尼C3S/ERA5来源 - **特征（5维）**：`纬度（lat）、经度（lon）、日周期正弦分量（day_sin）、日周期余弦分量（day_cos）、归一化年份（year_norm）` - **标签（5列）**：`最高气温（temp_max）、最低气温（temp_min）、降水量（precip）、风速（windspeed）、太阳辐射（solar_radiation）` ## 负责任AI - **个人/敏感数据**：无敏感数据。所有数据集仅包含分子、材料、蛋白质、污染物或气候变量的科学测量结果，未涉及人类受试者与任何个人可识别信息。 - **预期用途**：仅用于度量偏移问题的机器学习方法基准测试，不得直接用于临床、监管或安全关键型部署场景。 - **已知局限性**：(1) 本基准的6个子数据集均为现有公开数据集的重新整理，本工作的贡献在于度量对构建、有效性筛选与标准化评估协议。(2) 研究领域覆盖化学、生物学、材料科学、环境科学与气候科学，但尚未涉及高能物理、天文学或社会科学领域。(3) 特征空间设计为低维度（5~22维）以隔离源度量y1的贡献，使用高维编码器可能会改变不同方法的相对排名。 - **潜在误用风险**：`drug_admet`数据集包含ADME测量结果，理论上可用于辅助药物不良反应设计，但该数据集仅包含1,523个化合物，规模过小且精度粗糙，无法满足此类需求，且所有数据均已公开。 ## 维护说明作者承诺在论文发表后至少2年内维护该仓库，采用语义化版本号（v1.0、v1.1等），并针对每次数据集划分、筛选或评估协议变更提供CHANGELOG记录。 ## 引用格式 bibtex @inproceedings{metric_shift_2026, title={Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates}, author={Anonymous}, booktitle={NeurIPS 2026 Evaluations and Datasets Track}, year={2026}, note={Under review} }

提供机构：

metric-shift

5,000+

优质数据集

54 个

任务类型

进入经典数据集