anon12-neurips-2026/CrudeOilMix
收藏Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/anon12-neurips-2026/CrudeOilMix
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-regression
language:
- en
tags:
- crude-oil
- petroleum
- blend-prediction
- property-imputation
- multimodal
- tabular
- refinery
- energy
- simulation
size_categories:
- 1M<n<10M
pretty_name: CrudeOilMix
---
# CrudeOilMix: A Million-Scale Multimodal Benchmark for Crude Oil Characterization
**Paper:** NeurIPS 2026 Evaluations & Datasets Track (under review)
## Dataset Summary
CrudeOilMix is a multimodal benchmark comprising **1,141,933** crude oil samples generated by blending 9,061 real crude oil assays through H/CAMS, an industry-standard refinery simulation platform.
Each sample provides **three aligned modalities**:
- **`wc_ent`** — whole-crude entered (laboratory-equivalent) properties: 187 attributes, highly sparse
- **`wc_calc`** — whole-crude calculated (H/CAMS-derived) properties: 187 attributes, 64 fully observed
- **`cuts`** — nine distillation-cut fractions, each sharing the same 187-attribute schema
- **`tbp`** — four true-boiling-point curves (up to 41 points each)
- **`info`** — blend recipe and crude metadata
The dataset mirrors real-world refinery operations:
- 2-crude blends: 407,031 samples
- 3-crude blends: 510,977 samples
- 4-crude blends: 214,864 samples
## Benchmark Tasks
**Task 1 — Blend Property Prediction:** Given properties of up to 4 component crudes and their mixing ratios, predict 15 whole-crude properties of the resulting blend.
**Task 2 — Property Imputation:** Given a partially observed property vector (30–70% masked), predict values at masked positions.
Two split protocols are provided:
- **Split-R (global):** Random 80/10/10 split (train/val/test)
- **Split-C (composition):** Held-out crude combinations to test generalization to unseen blends
## Repository Structure
```
data/ Full dataset (14 GB)
wc_ent_part00000.parquet Whole-crude entered properties (58 parts)
wc_calc_part00000.parquet Whole-crude calculated properties (58 parts)
cuts_part00000.parquet Distillation-cut fractions (58 parts)
tbp_part00000.parquet True-boiling-point curves (58 parts)
info_part00000.parquet Crude metadata (58 parts)
blend_mix_manifest.parquet Blend recipes and n_components for all samples
split_global.parquet Split-R assignments (train/val/test)
split_c_components.parquet Split-C crude-level assignments
split_c_task1_blends.parquet Split-C blend-level assignments
sample/ Representative sample (10.9 MB, 800 samples)
*_sample.parquet One file per modality, 200 per blend complexity
```
## Loading the Data
```python
import pandas as pd
# Load the full wc_calc modality
import glob
parts = sorted(glob.glob("data/wc_calc_part*.parquet"))
wc_calc = pd.concat([pd.read_parquet(p) for p in parts], ignore_index=True)
# Or load the small sample (800 rows, ~11 MB total)
sample = pd.read_parquet("sample/wc_calc_sample.parquet")
# Load blend recipes
manifest = pd.read_parquet("data/blend_mix_manifest.parquet")
# Columns: oilid, n_components, is_blend, mix_json
# Load train/val/test splits
splits = pd.read_parquet("data/split_global.parquet")
train_ids = splits[splits.split == "train"]["oilid"]
```
## Physical Consistency
The dataset satisfies six petroleum-science invariants verified across all 1,141,933 samples:
- ASTM API–SG conversion: zero error (API = 141.5/SG − 131.5)
- Distillation temperature ordering: T10 < T50 < T90 (100% compliance)
- Mass balance conservation across distillation cuts
- Viscosity–temperature monotonicity (SV15 ≤ SV20)
- Density ordering (DN15 ≤ DN20)
- Yield–cut ordering
## Citation
```bibtex
@inproceedings{crudeolimix2026,
title = {{CrudeOilMix}: A Million-Scale Multimodal Benchmark for Crude Oil Blend Evaluation},
booktitle = {Advances in Neural Information Processing Systems},
year = {2026},
}
```
## License
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
The dataset is generated from H/CAMS industrial simulation software using anonymized crude oil assay data. It contains no personally identifiable information and no proprietary assay values.
CrudeOilMix is a multimodal benchmark comprising 1,141,933 crude oil samples generated by blending 9,061 real crude oil assays through H/CAMS, an industry-standard refinery simulation platform. Each sample provides three aligned modalities: whole-crude entered properties (187 attributes, highly sparse), whole-crude calculated properties (187 attributes, 64 fully observed), and nine distillation-cut fractions (each sharing the same 187-attribute schema). The dataset also includes four true-boiling-point curves (up to 41 points each) and blend recipe and crude metadata. The dataset mirrors real-world refinery operations, including 2-crude blends (407,031 samples), 3-crude blends (510,977 samples), and 4-crude blends (214,864 samples). The dataset also provides two benchmark tasks: Blend Property Prediction and Property Imputation, and two split protocols: random split and composition split.
提供机构:
anon12-neurips-2026
搜集汇总
数据集介绍

构建方式
CrudeOilMix数据集由9,061种真实原油样本通过行业标准炼油仿真平台H/CAMS混合生成,共计包含1,141,933条原油混合样品。该数据集模拟了炼油厂实际操作场景,涵盖2种、3种和4种原油的混合配方,分别生成407,031、510,977和214,864个样本。每个样本均提供三种对齐模态:全原油实验室等效属性、H/CAMS计算属性、九种蒸馏馏分属性,以及真实沸点曲线和混合配方元数据,全面反映了原油混合物的理化特性。
特点
该数据集的核心特点在于其多模态对齐结构与物理一致性验证。每个样本包含187维属性向量,其中全原油计算模态具有64个完整观测值,而实验室模态高度稀疏。数据集严格满足六项石油科学不变性,包括API比重转换零误差、蒸馏温度排序合规、质量平衡守恒等,确保了数据在工程应用中的可靠性。此外,提供了基于全局随机划分和成分留出法的两种数据拆分方案,分别用于评估模型的内插和外推泛化能力。
使用方法
用户可通过Pandas库加载Parquet格式的数据文件。全量数据包含58个分片文件,每个模态约14 GB,可借助glob模块批量读取并合并。同时提供包含800条样本的小规模演示数据,便于快速验证。混合配方信息存储于blend_mix_manifest.parquet文件中,而训练、验证和测试集的划分则通过split_global.parquet和split_c_*.parquet文件提供。该数据集适用于两类基准任务:混合属性预测和属性补全,支持多模态融合和表格回归模型的训练与评估。
背景与挑战
背景概述
CrudeOilMix是2026年由NeurIPS评估与数据集轨道提出的百万级多模态基准数据集,由研究团队基于H/CAMS工业炼厂仿真平台生成。该数据集源自9061种真实原油化验数据,通过混合模拟生成1,141,933个样本,涵盖2至4种原油混合场景。其核心研究问题聚焦于原油混合物的性质预测与属性插补,旨在弥合机器学习与石油工程之间的鸿沟。作为首个大规模、多模态的原油表征基准,CrudeOilMix为炼油工业的数字化优化提供了关键参考,显著推动了数据驱动方法在能源领域的应用。
当前挑战
CrudeOilMix旨在解决原油混合性质预测中属性稀疏、多模态异构及物理一致性约束等复杂挑战。在领域问题层面,传统经验模型难以精准捕捉多组分原油混合后的非线性性质变化,而机器学习方法需在保证热力学约束(如蒸馏温度排序、质量守恒)的同时实现高精度预测。在构建过程中,研究团队需从9千余种工业原油化验数据中合成百万级混合样本,通过H/CAMS仿真确保物理合理性,并设计随机与组合两种分割协议以评估模型对未见混合配方的泛化能力,这要求兼顾数据规模、模态对齐与物理合法性之间的平衡。
常用场景
经典使用场景
在石油化工与炼油工艺优化领域,CrudeOilMix数据集的核心应用场景是支持原油调和性质的多元预测任务。该数据集汇集了超过114万组通过工业标准模拟平台H/CAMS生成的调和样本,涵盖二组分、三组分与四组分混合的丰富场景。其经典使用方式是利用已知组分原油的性质及其配比,精准预测调和后原油的15种关键理化属性,如密度、硫含量及馏程特性。此外,数据集中还设计了属性插补任务,允许研究者在样本属性部分缺失(30%–70%)的情况下,基于多模态信息高效恢复完整性质向量。这一设计高度契合真实炼油厂数据稀疏且噪声复杂的现状,为发展鲁棒的预测与插补模型提供了坚实且规模庞大的基准。
衍生相关工作
CrudeOilMix数据集的发布催生了一系列富有影响力的衍生研究工作。在模型算法层面,研究者基于该多模态结构开发了针对稀疏性协同建模的图神经网络与Transformer变体,实现了调和性质的端到端预测,显著优于传统偏最小二乘与随机森林方法。性质插补分支则激发了结合参数化物理约束的自编码器模型,吸引了多篇聚焦于可解释性分析的学术产出的涌现。在基准评测方面,Split-C分布外拆分引发了关于组合外推能力度量新标准的讨论,推动领域形成了更严谨的评估协议。此外,该数据集还被迁移至无人化实验室中的虚拟采样设计,快速生成模拟数据以补充有限真实样本,极大丰富了数字孪生炼厂的研究工具集,已成为石油智能计算领域的标杆性开源资源。
数据集最近研究
最新研究方向
CrudeOilMix百万级多模态基准数据集的发布,为原油评估与炼化领域的前沿研究注入了全新动力。其在NeurIPS 2026的投稿背景,彰显了数据驱动方法在传统石油工业中的革命性潜力。当前,研究焦点集中于利用该数据集解决工业级混合原油性质预测与属性补全两大核心挑战。通过模拟从2至4种原油的真实调和场景,并严格遵循石油物理学守恒定律,CrudeOilMix不仅为端到端深度学习模型的鲁棒性验证提供了标准化的实验平台,更成为连接传统炼化模拟(H/CAMS)与现代机器学习范式的桥梁,有力推动着能源行业从经验驱动向量智融合的范式转型,其影响在可持续精炼与智能决策领域尤为深远。
以上内容由遇见数据集搜集并总结生成



