xpertsystems/oil019-sample

Name: xpertsystems/oil019-sample
Creator: xpertsystems
Published: 2026-05-22 14:04:33
License: 暂无描述

Hugging Face2026-05-22 更新2026-05-31 收录

下载链接：

https://hf-mirror.com/datasets/xpertsystems/oil019-sample

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - tabular-classification - tabular-regression language: - en tags: - synthetic - oil-and-gas - downstream - refining - distillation - fcc - catalytic-cracking - process-control - heat-exchanger - alarm-management - xpertsystems pretty_name: "OIL-019 — Synthetic Refinery Process Dataset (Sample)" size_categories: - 100K<n<1M --- # OIL-019 — Synthetic Refinery Process Dataset (Sample) **SKU:** `OIL019-SAMPLE` · **Vertical:** Oil & Gas / Downstream Refining **License:** CC-BY-NC-4.0 (sample) · **Schema version:** `oil019.v1` **Sample version:** `1.0.0` · **Default seed:** `42` A free, schema-identical preview of XpertSystems.ai's enterprise refinery process dataset for distillation column ML, FCC conversion modeling, PID control loop analytics, heat exchanger fouling prediction, blending optimization, and alarm management ML. The sample covers **30 refineries** with **360 process units** across **7 unit types**, with **210,820 rows** linked across **8 tables**. **This is the first downstream (refining) SKU in the XpertSystems Oil & Gas catalog** — complementing the upstream (drilling/production/EOR) and midstream (pipeline) SKUs already in the catalog. --- ## What's in the box | File | Rows | Cols | Description | |---|---:|---:|---| | `refinery_units.csv` | 360 | 5 | Process unit catalog: refinery_id, unit_id, unit_type (CDU/VDU/FCC/Hydrocracker/Coker/Reformer/Hydrotreater), throughput, ONLINE/MAINTENANCE status | | `distillation_columns.csv` | 25,500 | 6 | CDU+VDU tray-level snapshots: tray number, temperature, pressure, reflux ratio, timestamp | | `cracking_operations.csv` | 18,400 | 6 | FCC+Hydrocracker reactor metrics: reactor temperature, catalyst activity, conversion percentage, coke deposition | | `process_control_loops.csv` | 108,000 | 6 | Per-unit PID control snapshots: PV/SP tracking, controller output, AUTO/MANUAL mode | | `heat_exchanger_network.csv` | 43,200 | 5 | Per-unit shell-and-tube exchanger network: inlet/outlet temperature, fouling factor, heat duty | | `refinery_alarm_events.csv` | 5,000 | 6 | 6-class ISA-18.2 alarm events (High P/T, Low Flow, Pump Failure, Compressor Surge, Sensor Fault) + priority + duration | | `blending_operations.csv` | 10,000 | 5 | 6-class product blends (Gasoline/Diesel/Jet/LPG/Naphtha/Fuel Oil) + ASTM D2699 octane + sulfur ppm + volume | | `refinery_labels.csv` | 360 | 4 | Per-unit ML labels: optimization score + anomaly flag + shutdown risk | Total: **210,820 rows** across 8 CSVs, ~14.1 MB on disk. --- ## Calibration: industry-anchored, honestly reported Validation uses a **10-metric scorecard** with targets sourced exclusively to **named industry standards**: **UOP / Mobil FCC handbook** (FCC operating benchmarks), **API 660** (Shell-and-Tube Heat Exchangers), **TEMA Standards** (heat exchanger design), **ASTM D2699** (Research Octane Number Standard Test Method), ASTM D2622 (sulfur in gasoline), **API 521** (Pressure- relieving and Depressuring Systems), **ISA-18.2** (Management of Alarm Systems for the Process Industries), **ANSI/ISA-95** (Manufacturing Operations Management), EEMUA 191 (alarm management performance), **EIA Refinery Capacity Report**, AFPM (American Fuel & Petrochemical Manufacturers) annual statistics, NPRA Q&A and Technology Forum. **Sample run** (seed `42`, n_refineries=30, units_per_refinery=12): | # | Metric | Observed | Target | Tolerance | Status | Source | |---|---|---:|---:|---:|---|---| | 1 | avg throughput bpd | 225041.3370 | 225000.0 | ±30000.0 | ✓ PASS | EIA Refinery Capacity Report + AFPM annual statistics — mean throughput for large US refineries (100K-500K BPD range; 225K is the median US refinery capacity per EIA-820 data) | | 2 | avg distillation temp f | 649.9976 | 650.0 | ±80.0 | ✓ PASS | UOP / Honeywell refining process handbook + AFPM operations data — mean column temperature for atmospheric distillation (CDU bottoms ~750°F, mid-column ~600°F, vacuum distillation ~550°F; portfolio mean ~650°F) | | 3 | avg distillation pressure psi | 34.9751 | 35.0 | ±15.0 | ✓ PASS | UOP refining process handbook + API 560 fired heaters — mean operating pressure for atmospheric CDU (20-50 psi typical) and VDU (vacuum, 1-2 psi). Portfolio mean ~35 psi for mixed CDU/VDU operation | | 4 | avg cracking reactor temp f | 980.1136 | 980.0 | ±50.0 | ✓ PASS | UOP / Mobil FCC handbook + ExxonMobil RT process design — mean FCC reactor riser temperature for gasoline-mode operation (950-1010°F typical; 980°F is the optimal octane-conversion trade-off per Mobil/UOP) | | 5 | avg fcc conversion pct | 74.0454 | 74.0 | ±10.0 | ✓ PASS | UOP / Mobil FCC handbook — mean conversion percentage for FCC operation (65-85% typical; 74% reflects moderate-severity gasoline-mode operation with balanced LCO/HCO production) | | 6 | control tracking error std | 1.9998 | 2.0 | ±0.5 | ✓ PASS | ISA-95 Manufacturing Operations Management + ISA-18.2 alarm management — typical PID control loop PV-SP tracking error standard deviation for well-tuned process control (1.5-3.0 typical for production-grade loops) | | 7 | hx inlet outlet physical consistency | 1.0000 | 1.0 | ±0.005 | ✓ PASS | API 660 (Shell-and-Tube Heat Exchangers) + TEMA Standards — inlet temperature must exceed outlet temperature for cooling/condensing exchangers (process stream being cooled). Validates generator's HX physical realism. | | 8 | avg hx delta t f | 72.5364 | 72.5 | ±20.0 | ✓ PASS | API 660 + TEMA Standards for shell-and-tube heat exchangers — typical operating ΔT for refinery HX service (25-120°F typical; 72.5°F median for mixed preheat/cooler/condenser service) | | 9 | avg blend octane rating | 90.0209 | 90.0 | ±5.0 | ✓ PASS | ASTM D2699 (Research Octane Number Standard Test Method) — mean octane rating for gasoline blend portfolio (82-98 RON range covering regular 87, midgrade 89, premium 91-93, and aviation 100LL) | | 10 | anomaly flag rate | 0.0417 | 0.04 | ±0.02 | ✓ PASS | ISA-18.2 Management of Alarm Systems for the Process Industries — typical anomaly/upset rate for production-grade refinery units (2-6% of operating periods exhibit detectable upsets per EEMUA 191 / NAMUR NA-102 operational statistics) | **Overall: 100.0/100 — Grade A+** (10 PASS · 0 MARGINAL · 0 FAIL of 10 metrics) --- ## Schema highlights **`refinery_units.csv`** — process unit catalog with **7 unit types** per UOP/AFPM refining nomenclature: | Unit type | Function | Detail table | |---|---|---| | CDU | Crude Distillation Unit (atmospheric) | `distillation_columns.csv` ✓ | | VDU | Vacuum Distillation Unit | `distillation_columns.csv` ✓ | | FCC | Fluid Catalytic Cracker | `cracking_operations.csv` ✓ | | Hydrocracker | High-pressure hydrogen cracker | `cracking_operations.csv` ✓ | | Coker | Delayed coker | (units_master only — see Honest Disclosure §1) | | Reformer | Catalytic reformer | (units_master only) | | Hydrotreater | Hydrodesulfurization unit | (units_master only) | **`distillation_columns.csv`** — tray-level snapshots for atmospheric and vacuum distillation: > tray_number = randint(1, 65) # 1-64 trays (typical column) > temperature_f = N(650, 40) # ~650°F mean per UOP CDU benchmarks > pressure_psi = N(35, 5) # ~35 psi atmospheric CDU > reflux_ratio = U(1.1, 4.8) # typical industry range **`cracking_operations.csv`** — FCC and hydrocracker reactor operations per **UOP / Mobil FCC handbook**: > reactor_temp_f = N(980, 25) # FCC riser temp per Mobil FCC > catalyst_activity = N(82, 4) % # MAT activity per ASTM D5757 > conversion = N(74, 6) % # gasoline-mode conversion > coke_deposition = U(0.1, 6.5) % # catalyst coke per UOP **`process_control_loops.csv`** — PID PV/SP tracking per **ISA-95** with **2.0 standard deviation tracking error**: > PV = SP + N(0, 2.0) > tracking_std observed ≈ 2.0 in sample (bullseye for declared cfg) **`heat_exchanger_network.csv`** — shell-and-tube HX per **API 660**: > inlet_temp_f = N(550, 35) > outlet_temp_f = inlet − U(25, 120) # heat removed (cooling) > # inlet > outlet enforced for 100% of rows **`blending_operations.csv`** — product blending per **ASTM D2699 RON**: > octane_rating = U(82, 98) # full gasoline grade range > sulfur_ppm = U(5, 500) # pre-Tier 3 to ULSD range --- ## Suggested use cases 1. **FCC conversion regression** — predict `conversion_pct` from reactor_temp + catalyst_activity + coke_deposition features. Strong physics signal: independent Gaussian distributions allow clean regression learning. 2. **Distillation column anomaly detection** — multi-variate anomaly detection on tray-level T/P/reflux features for column instability ML. 3. **PID control loop tuning** — regression on tracking error (`pv_value − sp_value`) from controller_output + mode features for adaptive control ML. 4. **Heat exchanger fouling prediction** — regression on `fouling_factor` from inlet/outlet temp + heat duty features. Useful as cleaning-schedule optimization label. 5. **Heat exchanger heat duty estimation** — regression on `heat_duty_mmbtu_hr` from temp differential features. Anchored to API 660 / TEMA design conventions. 6. **6-class alarm priority classification** — multi-class classifier on `priority` × `alarm_type` features per ISA-18.2 alarm management. 7. **6-class product grade classification** — multi-class classifier on `product_grade` from octane + sulfur + volume features per ASTM D2699. 8. **2-class unit operating status classification** — binary classifier on `operating_status` (ONLINE/MAINTENANCE) from unit characteristics; see Honest Disclosure §5 for the 24% maintenance rate caveat. 9. **Anomaly flag binary classification** — binary classifier on `anomaly_flag` per ISA-18.2 — useful as label-only reference; see Honest Disclosure §3 for the feature-coupling caveat. 10. **Multi-table relational ML** — entity-resolution across the 7 joinable tables via `refinery_id` + `unit_id`. --- ## Loading ```python from datasets import load_dataset ds = load_dataset("xpertsystems/oil019-sample", data_files="distillation_columns.csv") print(ds["train"][0]) ``` Or with pandas: ```python import pandas as pd units = pd.read_csv("hf://datasets/xpertsystems/oil019-sample/refinery_units.csv") dist = pd.read_csv("hf://datasets/xpertsystems/oil019-sample/distillation_columns.csv") crack = pd.read_csv("hf://datasets/xpertsystems/oil019-sample/cracking_operations.csv") ctrl = pd.read_csv("hf://datasets/xpertsystems/oil019-sample/process_control_loops.csv") # Join distillation rows to unit metadata dist_joined = dist.merge(units, left_on="column_id", right_on="unit_id") # Now you have refinery_id + unit_type + throughput alongside column operating data ``` --- ## Reproducibility All generation is deterministic via the integer `seed` parameter (driving both `random.seed` and `np.random.seed`). A seed sweep across `[42, 7, 123, 2024, 99, 1]` confirms Grade A+ on every seed in this sample. --- ## Honest disclosure of sample-scale limitations This is a **sample** product calibrated for refinery process ML research, not for live operational decisions. **The OIL-019 generator uses predominantly marginal Gaussian/uniform sampling without feature-coupled physics** — this gives clean training signal for marginal-property ML but limits cross-feature coupling. Several important notes: 1. **3 of 7 unit types have no detail tables.** Coker, Reformer, and Hydrotreater units appear in `refinery_units.csv` but **do not generate any detail-table rows** (only CDU+VDU → distillation_columns and FCC+Hydrocracker → cracking_operations are populated). The generator's docstring lists `catalyst_performance.csv`, `hydrotreating_operations.csv`, `furnace_operations.csv`, and `compressor_pump_telemetry.csv` as outputs but **these are not produced by the current generator**. For ML on Coker/Reformer/ Hydrotreater units, use only the unit-level features (throughput, status); full product v1.1 will add the missing detail tables. 2. **`blending_operations.csv` is NOT joinable to refinery_units.csv.** The blending table has no `unit_id` or `refinery_id` column — blends are decoupled from any specific refinery or unit. Treat the blending table as a **standalone product-property ML reference** rather than as a refinery-output supply chain table. For refinery-to-blend traceability, the full product v1.1 will add refinery + unit linkages. 3. **`refinery_labels.csv` has NO feature coupling.** All three label columns (`optimization_score`, `anomaly_flag`, `shutdown_risk`) are sampled from independent uniform/Bernoulli distributions without any relationship to upstream features in distillation, cracking, controls, heat exchanger, alarm, or blending tables. **Models trained to predict any label from upstream features will not learn meaningful patterns** because the label is not a function of the features. The labels table is best used as a **reference distribution** for production label calibration, not as a supervised ML target. To build feature-coupled labels, derive them yourself from weighted combinations of upstream features (e.g., `optimization_score = f(catalyst_activity, conversion, fouling)`). 4. **Distillation column has no tray-to-tray temperature gradient.** Real CDU columns have a steep temperature gradient (~700°F at the bottom tray vs ~250°F at the top tray; ~450°F differential). The generator samples `temperature_f = N(650, 40)` independently of `tray_number`, so top-tray and bottom-tray temperatures are identical on average. **Tray-by-tray distillation profile ML on this sample will learn marginals, not physics.** For proper tray-profile ML, post-process the data with a McCabe-Thiele or Fenske-Underwood-Gilliland tray-gradient calculation, or wait for v1.1 which will introduce gradient-conditioned tray temperatures. 5. **Maintenance fraction is ~24% at sample scale.** The generator samples `random.choice(["ONLINE","ONLINE","ONLINE","MAINTENANCE"])` = 25% MAINTENANCE. Real US refinery utilization is 90%+ per EIA-820 Refinery Capacity Report, so MAINTENANCE should be ~5-10% of unit- periods. The sample's high maintenance rate is a generator quirk; for utilization-realistic ML, downsample MAINTENANCE rows to ~10% or filter them out. 6. **Process control loops are per-unit panels, not multi-loop networks.** Each unit has 300 rows in process_control_loops.csv indexed `LOOP_000` through `LOOP_299`, but these are **timesteps of a single loop, not distinct control loops**. The `loop_id` naming is misleading. Treat the column as a timestep index rather than as a loop identifier; for true multi-loop ML, sample `loop_id` per-unit and group by loop. 7. **Heat exchanger fouling is uniform-sampled, not time-varying.** The `fouling_factor` is `U(0.001, 0.04)` independent of operating hours, inlet temperature, or process service. Real HX fouling grows monotonically over runtime per TEMA RGP-T-2.4. For fouling-progression ML, this sample is not suitable; v1.1 will add runtime-conditioned fouling growth. 8. **Anomaly types are uniformly sampled** (~17% each across 6 classes). Real refinery alarm distributions are heavily skewed per ISA-18.2 / EEMUA 191 statistics (sensor faults dominate ~40-60%, high-T/P trips less common). Treat `alarm_type` as label-only for classifier training; full product v1.1 will add feature-conditioned alarm priors. --- ## Cross-references to other XpertSystems OIL SKUs This SKU is the **first downstream (refining) SKU** in the XpertSystems catalog. It complements the upstream and midstream SKUs already published: | SKU | Layer | Focus | |---|---|---| | OIL-001 to OIL-014, OIL-016 to OIL-018 | Upstream | Drilling, production, lift, decline, multiphase flow | | OIL-015 | Midstream | Pipeline flow assurance | | OIL-017 | Upstream EOR | Waterflood / water injection | | **OIL-019** | **Downstream** | **Refinery process operations** *(this SKU — new sub-vertical)* | This SKU opens a **new buyer persona** for the XpertSystems catalog: process engineers, refinery operations specialists, and process control engineers at refining operators (Marathon, Valero, Phillips 66, ExxonMobil Refining, Shell Downstream, BP Refining, Chinese/Indian state refiners) and refining EPC contractors (UOP/Honeywell, Axens, Shaw E&C, Wood Group) who need synthetic data for digital twin training, advanced process control ML, and operations optimization. --- ## Full product The **full OIL-019 dataset** (in development) will ship at significantly larger scale with **all 4 missing detail tables** (catalyst_performance, hydrotreating_operations, furnace_operations, compressor_pump_telemetry), **feature-coupled labels** derived from upstream operations features, **tray-gradient distillation profiles** with McCabe-Thiele consistency, **runtime-conditioned heat exchanger fouling**, **utilization-realistic maintenance rates**, and **blending-to-refinery supply chain linkage** — licensed commercially. Contact XpertSystems.ai for licensing terms. 📧 **pradeep@xpertsystems.ai** 🌐 **https://xpertsystems.ai** --- ## Citation ```bibtex @dataset{xpertsystems_oil019_sample_2026, title = {OIL-019: Synthetic Refinery Process Dataset (Sample)}, author = {XpertSystems.ai}, year = {2026}, url = {https://huggingface.co/datasets/xpertsystems/oil019-sample} } ``` ## Generation details - Sample version : 1.0.0 - Random seed : 42 - Generated : 2026-05-22 14:00:14 UTC - Refineries : 30 - Units per refinery: 12 (360 total units) - Unit types : 7 (CDU, VDU, FCC, Hydrocracker, Coker, Reformer, Hydrotreater) - Product grades : 6 (Gasoline, Diesel, Jet Fuel, LPG, Naphtha, Fuel Oil) - Alarm types : 6 (High P, High T, Low Flow, Pump Failure, Compressor Surge, Sensor Fault) - Calibration basis : UOP / Mobil FCC handbook, API 660, TEMA Standards, ASTM D2699, ASTM D2622, API 521, ISA-18.2, ANSI/ISA-95, EEMUA 191, EIA-820 Refinery Capacity, AFPM annual statistics, NPRA Q&A - Overall validation: 100.0/100 — Grade A+

提供机构：

xpertsystems

5,000+

优质数据集

54 个

任务类型

进入经典数据集