five

metric-shift/metric-shift-benchmark

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/metric-shift/metric-shift-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: - cc-by-4.0 - mit - other task_categories: - tabular-regression tags: - metric-shift - benchmark - scientific-ml - cross-domain - cross-metric-prediction size_categories: - 100K<n<1M configs: - config_name: zinc250k data_files: - split: features path: zinc250k/features.csv - split: labels path: zinc250k/labels.csv - config_name: air_quality data_files: - split: features path: air_quality/features.csv - split: labels path: air_quality/labels.csv - config_name: jarvis_materials data_files: - split: features path: jarvis_materials/features.csv - split: labels path: jarvis_materials/labels.csv - config_name: protein_fitness_expanded data_files: - split: features path: protein_fitness_expanded/features.csv - split: labels path: protein_fitness_expanded/labels.csv - config_name: drug_admet data_files: - split: features path: drug_admet/features.csv - split: labels path: drug_admet/labels.csv - config_name: climate_stations data_files: - split: features path: climate_stations/features.csv - split: labels path: climate_stations/labels.csv --- # Metric Shift Benchmark A cross-domain benchmark for predicting expensive scientific measurements from cheap surrogates, spanning **6 scientific fields** and **134 valid (y1, y2) pairs** with a standardized evaluation protocol. **Paper:** *Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates* (NeurIPS 2026 Evaluations & Datasets Track, under review) ## Benchmark Overview | Dataset | Domain | Samples | Feat. dim | Labels | Valid pairs | License | |---------|--------|---------|-----------|--------|-------------|---------| | `zinc250k` | Drug Chemistry | 249,455 | 14 | 3 | 6 | ZINC academic-use, f... | | `air_quality` | Environmental Science | 382,168 | 7 | 6 | 28 | CC-BY-4.0 (UCI ML Re... | | `jarvis_materials` | Materials Science | 10,800 | 14 | 6 | 30 | Public domain / NIST... | | `protein_fitness_expanded` | Protein Biology | 61,704 | 22 | 24 | 38 | MIT (ProteinGym aggr... | | `drug_admet` | Pharmacology | 1,523 | 14 | 4 | 12 | CC-BY-4.0 (Polaris H... | | `climate_stations` | Climate Science | 28,488 | 5 | 5 | 20 | CC-BY-4.0, dual attr... | | **Total** | --- | **734,138** | --- | --- | **134** | --- | ## Problem: Metric Shift Given a shared entity x (molecule, material, protein variant), a cheap source metric y1, and an expensive target metric y2: can we use universally available y1 to improve prediction of the sparsely labeled y2? Key properties: - y1 is **always available at test time** (cheap to measure for any new candidate) - The input distribution p(x) is fixed; only the prediction target changes - Unlike domain adaptation (shifts p(x)) or multi-task learning (co-predicts) ## Evaluation Protocol - **Split:** 60% train / 20% val / 20% test at `split_seed=42` - **Labeled ratio:** 20% of train (main setting); 1% and 5% for ablation - **Seeds:** 5 model seeds per pair - **Metrics:** R-squared and Spearman rho - **Significance:** Paired t-test across seeds + Benjamini-Hochberg FDR at q=0.05 - **Aggregation:** Macro-median (per-dataset median, then cross-dataset median) - **StandardScaler:** fit on labeled train only ## Usage ```python import pandas as pd # Load one sub-dataset features = pd.read_csv("zinc250k/features.csv") labels = pd.read_csv("zinc250k/labels.csv") # Each (source, target) column pair in labels defines a Metric Shift task # See metadata.json for the list of valid pairs with Spearman correlations ``` One-command reproduction of all tables and figures: ```bash pip install metric-shift-benchmark python -m metric_shift.run_all ``` ## Dataset Details ### `zinc250k` — Drug Chemistry 249,455 drug-like molecules, 14 RDKit descriptors, 3 labels (logP, QED, SAS), 6 pairs - **Source:** ZINC database (Irwin & Shoichet 2005; Sterling & Irwin 2015) - **License:** ZINC academic-use, free redistribution with attribution - **Features (14d):** `MolWt, HeavyAtomCount, NumHeteroatoms, NumValenceElectrons, TPSA, MolMR, HBA, HBD, NumRotatableBonds, RingCount, NumAromaticRings, FractionCSP3, BalabanJ, BertzCT` - **Labels (3col):** `logP, QED, SAS` ### `air_quality` — Environmental Science 382,168 hourly records, 7 meteo features, 6 pollutants, 28 pairs - **Source:** Beijing Multi-Site Air-Quality Dataset (Zhang et al. 2017) - **License:** CC-BY-4.0 (UCI ML Repository) - **Features (7d):** `TEMP, PRES, DEWP, RAIN, WSPM, wd_sin, wd_cos` - **Labels (6col):** `PM25, PM10, SO2, NO2, CO, O3` ### `jarvis_materials` — Materials Science 10,800 inorganic crystals, 14 composition descriptors, 6 labels, 30 pairs - **Source:** JARVIS-DFT 3D (Choudhary et al. 2020) - **License:** Public domain / NIST (17 USC §105) - **Features (14d):** `mean_Z, std_Z, mean_X, std_X, mean_row, std_row, mean_group, std_group, mean_atomic_mass, std_atomic_mass, density, volume_per_atom, n_sites, packing_fraction` - **Labels (6col):** `formation_energy_peratom, optb88vdw_bandgap, bulk_modulus_kv, shear_modulus_gv, n_seebeck, p_seebeck` ### `protein_fitness_expanded` — Protein Biology 61,704 variants, 22-d mutation features, 24 DMS assays, 38 within-protein pairs - **Source:** ProteinGym substitution benchmark (Notin et al. 2023) - **License:** MIT (ProteinGym aggregation) - **Features (22d):** `protein_id, n_mutations, AA_A_diff, AA_C_diff, AA_D_diff, AA_E_diff, AA_F_diff, AA_G_diff, AA_H_diff, AA_I_diff, AA_K_diff, AA_L_diff, AA_M_diff, AA_N_diff, AA_P_diff, AA_Q_diff, AA_R_diff, AA_S_diff, AA_T_diff, AA_V_diff, AA_W_diff, AA_Y_diff` - **Labels (24col):** `p53_null_etoposide, p53_null_nutlin, p53_wt_nutlin, blat_deng_2012, blat_firnberg_2014, blat_jacquier_2013, blat_stiffler_2015, pten_matreyek_2021, pten_mighell_2018, cp2c9_amorosi_abundance_2021, cp2c9_amorosi_activity_2021, hsp82_flynn_2019, hsp82_mishra_2016, spike_starr_bind_2020, spike_starr_expr_2020, a0a2z5u3z0_doud_2016, a0a2z5u3z0_wu_2014, rl401_mavor_2016, rl401_roscoe_2013, rl401_roscoe_2014, ccdb_adkar_2012, ccdb_tripathi_2016, vkor1_chiasson_abundance_2020, vkor1_chiasson_activity_2020` ### `drug_admet` — Pharmacology 1,523 compounds, 14 RDKit descriptors, 4 ADME endpoints, 12 pairs - **Source:** Biogen ADME-Fang v1 (Fang et al. 2023) - **License:** CC-BY-4.0 (Polaris Hub) - **Features (14d):** `MolWt, HeavyAtomCount, NumHBD, NumHBA, TPSA, MolLogP, NumRotatableBonds, RingCount, NumAromaticRings, FractionCSP3, MolMR, BertzCT, BalabanJ, NumHeteroatoms` - **Labels (4col):** `LOG_HLM_CLint, LOG_RLM_CLint, LOG_SOLUBILITY, LOG_MDR1-MDCK_ER` ### `climate_stations` — Climate Science 28,488 daily records, 5 context features, 5 climate variables, 20 pairs - **Source:** Open-Meteo Historical Weather API / ERA5 reanalysis - **License:** CC-BY-4.0, dual attribution to Open-Meteo and Copernicus C3S/ERA5 - **Features (5d):** `lat, lon, day_sin, day_cos, year_norm` - **Labels (5col):** `temp_max, temp_min, precip, windspeed, solar_radiation` ## Responsible AI - **Personal / sensitive data:** None. All datasets contain scientific measurements on molecules, materials, proteins, pollutants, or climate variables. No human subjects, no personally identifiable information. - **Intended use:** Benchmarking ML methods for the Metric Shift problem. Not intended for direct clinical, regulatory, or safety-critical deployment. - **Known limitations:** (1) All six datasets are re-curations of existing public sources; our contribution is pair construction, validity filter, and protocol. (2) Domain coverage spans chemistry, biology, materials, environment, and climate --- not yet high-energy physics, astronomy, or social science. (3) Feature spaces are intentionally low-dimensional (5--22d) to isolate the contribution of y1; higher-dimensional encoders may change relative method rankings. - **Potential misuse:** drug_admet contains ADME measurements that could theoretically inform adverse drug design; however, the 1,523-compound dataset is far too small and coarse for such purposes, and all data is already public. ## Maintenance The authors commit to maintaining this repository for at least 2 years post-publication, with semantic versioning (v1.0, v1.1, ...) and a CHANGELOG for every split, filter, or protocol change. ## Citation ```bibtex @inproceedings{metric_shift_2026, title={Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates}, author={Anonymous}, booktitle={NeurIPS 2026 Evaluations and Datasets Track}, year={2026}, note={Under review} } ```

license: - cc-by-4.0 - mit - other task_categories: - tabular-regression tags: - metric-shift - benchmark - scientific-ml - cross-domain - cross-metric-prediction size_categories: - 100K<n<1M configs: - config_name: zinc250k data_files: - split: features path: zinc250k/features.csv - split: labels path: zinc250k/labels.csv - config_name: air_quality data_files: - split: features path: air_quality/features.csv - split: labels path: air_quality/labels.csv - config_name: jarvis_materials data_files: - split: features path: jarvis_materials/features.csv - split: labels path: jarvis_materials/labels.csv - config_name: protein_fitness_expanded data_files: - split: features path: protein_fitness_expanded/features.csv - split: labels path: protein_fitness_expanded/labels.csv - config_name: drug_admet data_files: - split: features path: drug_admet/features.csv - split: labels path: drug_admet/labels.csv - config_name: climate_stations data_files: - split: features path: climate_stations/features.csv - split: labels path: climate_stations/labels.csv # 度量偏移基准测试集(Metric Shift Benchmark) 一个跨域基准测试集,旨在从廉价替代度量中预测成本高昂的科学测量结果,涵盖**6个科学领域**与**134组有效(y1,y2)度量对**,并配备标准化评估流程。 **相关论文**:*《度量偏移:基于廉价替代度量的高成本科学测量预测基准测试集》*(NeurIPS 2026评估与数据集赛道,待审) ## 基准测试集概览 | 数据集 | 研究领域 | 样本量 | 特征维度 | 标签数 | 有效度量对数量 | 许可证 | |---------|--------|---------|-----------|--------|-------------|---------| | `zinc250k` | 药物化学 | 249,455 | 14 | 3 | 6 | ZINC学术使用许可,标注来源后可免费再分发 | | `air_quality` | 环境科学 | 382,168 | 7 | 6 | 28 | CC BY 4.0(UCI机器学习仓库) | | `jarvis_materials` | 材料科学 | 10,800 | 14 | 6 | 30 | 公有领域 / 美国国家标准与技术研究院(NIST)许可 | | `protein_fitness_expanded` | 蛋白质生物学 | 61,704 | 22 | 24 | 38 | MIT许可证(ProteinGym聚合数据集) | | `drug_admet` | 药理学 | 1,523 | 14 | 4 | 12 | CC BY 4.0(Polaris Hub) | | `climate_stations` | 气候科学 | 28,488 | 5 | 5 | 20 | CC BY 4.0,需同时标注Open-Meteo与哥白尼C3S/ERA5来源 | | **总计** | --- | **734,138** | --- | --- | **134** | --- | ## 问题定义:度量偏移(metric-shift) 给定共享实体x(分子、材料、蛋白质变体)、廉价源度量y1与高成本目标度量y2,我们能否利用通用可得的y1来优化对稀疏标注y2的预测? 核心特性: - y1**在测试阶段始终可用**(对任意新候选样本,测量成本极低) - 输入分布p(x)固定不变,仅预测目标发生变化 - 不同于域自适应(会改变p(x)分布)或多任务学习(同时进行多目标预测) ## 评估协议 - **数据集划分**:以`split_seed=42`为随机种子,按60%训练集/20%验证集/20%测试集划分 - **标注比例**:训练集的20%为标注样本(主实验设置);消融实验分别采用1%与5%的标注比例 - **随机种子**:每组度量对使用5个模型训练随机种子 - **评估指标**:决定系数(R-squared)与斯皮尔曼秩相关系数(Spearman rho) - **显著性检验**:基于多种子结果的配对t检验,结合q=0.05的Benjamini-Hochberg错误发现率(FDR)校正 - **结果聚合**:宏中位数法(先计算每个数据集内的中位数,再计算跨数据集的中位数) - **标准化处理**:仅基于标注训练集拟合StandardScaler标准化器 ## 使用方法 python import pandas as pd # 加载单个子数据集 features = pd.read_csv("zinc250k/features.csv") labels = pd.read_csv("zinc250k/labels.csv") # 标签文件中每一组(源度量,目标度量)列对均构成一个度量偏移任务 # 可参考metadata.json文件获取带有斯皮尔曼相关系数的有效度量对列表 一键复现所有表格与图表: bash pip install metric-shift-benchmark python -m metric_shift.run_all ## 数据集详情 ### `zinc250k` — 药物化学 包含249,455个类药物分子,14个RDKit描述符(RDKit descriptor),3个标签(logP、QED、SAS),共6组度量对 - **数据来源**:ZINC数据库(Irwin与Shoichet 2005;Sterling与Irwin 2015) - **许可证**:ZINC学术使用许可,标注来源后可免费再分发 - **特征(14维)**:`MolWt、重原子数、杂原子数、价电子数、拓扑极表面积(TPSA)、摩尔摩尔折射率(MolMR)、氢键受体数(HBA)、氢键供体数(HBD)、可旋转键数、环计数、芳香环数、sp3杂化碳比例(FractionCSP3)、BalabanJ指数、BertzCT指数` - **标签(3列)**:`logP、QED、SAS` ### `air_quality` — 环境科学 包含382,168条小时级记录,7个气象特征,6种污染物,共28组度量对 - **数据来源**:北京多站点空气质量数据集(Zhang等人2017) - **许可证**:CC BY 4.0(UCI机器学习仓库) - **特征(7维)**:`温度(TEMP)、气压(PRES)、露点温度(DEWP)、降水量(RAIN)、风速(WSPM)、风向正弦分量(wd_sin)、风向余弦分量(wd_cos)` - **标签(6列)**:`PM2.5、PM10、二氧化硫(SO2)、二氧化氮(NO2)、一氧化碳(CO)、臭氧(O3)` ### `jarvis_materials` — 材料科学 包含10,800个无机晶体,14个组分描述符,6个标签,共30组度量对 - **数据来源**:JARVIS-DFT 3D数据库(Choudhary等人2020) - **许可证**:公有领域 / 美国国家标准与技术研究院(NIST)许可(符合美国版权法第17编第105条) - **特征(14维)**:`平均原子序数(mean_Z)、原子序数标准差(std_Z)、平均电负性(mean_X)、电负性标准差(std_X)、平均周期数(mean_row)、周期数标准差(std_row)、平均族数(mean_group)、族数标准差(std_group)、平均原子量(mean_atomic_mass)、原子量标准差(std_atomic_mass)、密度、单原子体积、晶位数量、堆积分数(packing_fraction)` - **标签(6列)**:`每原子形成能(formation_energy_peratom)、optb88vdw能带隙(optb88vdw_bandgap)、体积模量(bulk_modulus_kv)、剪切模量(shear_modulus_gv)、n型塞贝克系数(n_seebeck)、p型塞贝克系数(p_seebeck)` ### `protein_fitness_expanded` — 蛋白质生物学 包含61,704个蛋白质变体,22维突变特征,24个深度突变扫描(Deep Mutational Scan, DMS)实验结果,共38组蛋白内度量对 - **数据来源**:ProteinGym替换突变基准测试集(Notin等人2023) - **许可证**:MIT许可证(ProteinGym聚合数据集) - **特征(22维)**:`蛋白质ID(protein_id)、突变数量(n_mutations)、丙氨酸(A)差异、半胱氨酸(C)差异、天冬氨酸(D)差异、谷氨酸(E)差异、苯丙氨酸(F)差异、甘氨酸(G)差异、组氨酸(H)差异、异亮氨酸(I)差异、赖氨酸(K)差异、亮氨酸(L)差异、甲硫氨酸(M)差异、天冬酰胺(N)差异、脯氨酸(P)差异、谷氨酰胺(Q)差异、精氨酸(R)差异、丝氨酸(S)差异、苏氨酸(T)差异、缬氨酸(V)差异、色氨酸(W)差异、酪氨酸(Y)差异` - **标签(24列)**:`p53_null_etoposide、p53_null_nutlin、p53_wt_nutlin、blat_deng_2012、blat_firnberg_2014、blat_jacquier_2013、blat_stiffler_2015、pten_matreyek_2021、pten_mighell_2018、cp2c9_amorosi_abundance_2021、cp2c9_amorosi_activity_2021、hsp82_flynn_2019、hsp82_mishra_2016、spike_starr_bind_2020、spike_starr_expr_2020、a0a2z5u3z0_doud_2016、a0a2z5u3z0_wu_2014、rl401_mavor_2016、rl401_roscoe_2013、rl401_roscoe_2014、ccdb_adkar_2012、ccdb_tripathi_2016、vkor1_chiasson_abundance_2020、vkor1_chiasson_activity_2020` ### `drug_admet` — 药理学 包含1,523个化合物,14个RDKit描述符(RDKit descriptor),4个药物代谢动力学(ADME)终点指标,共12组度量对 - **数据来源**:Biogen ADME-Fang v1数据集(Fang等人2023) - **许可证**:CC BY 4.0(Polaris Hub) - **特征(14维)**:`分子量(MolWt)、重原子数、氢键供体数(NumHBD)、氢键受体数(NumHBA)、拓扑极表面积(TPSA)、辛醇-水分配系数(MolLogP)、可旋转键数、环计数、芳香环数、sp3杂化碳比例(FractionCSP3)、摩尔摩尔折射率(MolMR)、BertzCT指数、BalabanJ指数、杂原子数(NumHeteroatoms)` - **标签(4列)**:`人肝微粒体内在清除率对数(LOG_HLM_CLint)、大鼠肝微粒体内在清除率对数(LOG_RLM_CLint)、溶解度对数(LOG_SOLUBILITY)、MDR1-MDCK细胞外排比对数(LOG_MDR1-MDCK_ER)` ### `climate_stations` — 气候科学 包含28,488条日级记录,5个上下文特征,5个气候变量,共20组度量对 - **数据来源**:Open-Meteo历史天气API / ERA5再分析数据集 - **许可证**:CC BY 4.0,需同时标注Open-Meteo与哥白尼C3S/ERA5来源 - **特征(5维)**:`纬度(lat)、经度(lon)、日周期正弦分量(day_sin)、日周期余弦分量(day_cos)、归一化年份(year_norm)` - **标签(5列)**:`最高气温(temp_max)、最低气温(temp_min)、降水量(precip)、风速(windspeed)、太阳辐射(solar_radiation)` ## 负责任AI - **个人/敏感数据**:无敏感数据。所有数据集仅包含分子、材料、蛋白质、污染物或气候变量的科学测量结果,未涉及人类受试者与任何个人可识别信息。 - **预期用途**:仅用于度量偏移问题的机器学习方法基准测试,不得直接用于临床、监管或安全关键型部署场景。 - **已知局限性**:(1) 本基准的6个子数据集均为现有公开数据集的重新整理,本工作的贡献在于度量对构建、有效性筛选与标准化评估协议。(2) 研究领域覆盖化学、生物学、材料科学、环境科学与气候科学,但尚未涉及高能物理、天文学或社会科学领域。(3) 特征空间设计为低维度(5~22维)以隔离源度量y1的贡献,使用高维编码器可能会改变不同方法的相对排名。 - **潜在误用风险**:`drug_admet`数据集包含ADME测量结果,理论上可用于辅助药物不良反应设计,但该数据集仅包含1,523个化合物,规模过小且精度粗糙,无法满足此类需求,且所有数据均已公开。 ## 维护说明 作者承诺在论文发表后至少2年内维护该仓库,采用语义化版本号(v1.0、v1.1等),并针对每次数据集划分、筛选或评估协议变更提供CHANGELOG记录。 ## 引用格式 bibtex @inproceedings{metric_shift_2026, title={Metric Shift: A Benchmark for Predicting Expensive Scientific Measurements from Cheap Surrogates}, author={Anonymous}, booktitle={NeurIPS 2026 Evaluations and Datasets Track}, year={2026}, note={Under review} }
提供机构:
metric-shift
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作