[Data] Prediction of soft proton intensities in the near-Earth space using machine learning
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4718560
下载链接
链接失效反馈官方服务:
资源简介:
1 Introduction
The dataset consists of four files:
File Name Type
RAPID_OMNI_ML_023_raw.h5 dataset
RAPID_OMNI_ML_023_traincut.h5 dataset
RAPID_OMNI_ML_023_testcut.h5 dataset
RAPID_OMNI_ML_023_robusttranscut.pkl pre-fit scaler
1.1 Variables
The dataset files contain the following variables.
Variable Name Type Unit Description
p1, p2, p3, float 1/(s⋅cm²⋅sr⋅keV) proton intensities
p4, p5, p6,
p7
x, y, z float RE position in GSE coordinates
rdist float RE radial distance from the Earth
AE_index float nT Auroral Electroject (AE) index
SYM-H_index float nT symmetric disturbance field in horiz. direction
F107 float sfu the solar radio flux at 10.7 cm
BimfxGSE, float nT x, y and z components of the Interplanetary
BimfyGSE, Magnetic Field in GSE coordinates
BimfzGSE
VxSW_GSE, float km/s x, y and z components of the solar wind speed
VySW_GSE,
VzSW_GSE
NpSW float n/cc solar wind density
Temp float K solar wind temperature
Pdyn float nPa solar wind dynamic pressure
DateTime datetime timestamp
1.2 Data Split Ranges
The datasets contain data from the following time ranges.
Dataset Start End Count
RAPID_OMNI_ML_023_raw.h5 2001-01-09 15:21:00 2018-02-19 09:57:00 6,051,937
RAPID_OMNI_ML_023_traincut.h5 2001-01-09 15:21:00 2014-07-24 22:44:00 4,524,200
RAPID_OMNI_ML_023_testcut.h5 2014-07-24 22:45:00 2018-02-19 09:57:00 1,173,865
2 Raw Data Preparation
2.1 Data Source
OMNIWeb (NASA)
From NASA/GSFC's OMNI data set through OMNIWeb, we extracted the following variables from 2001 to 2019:
Variable Original Name in OMNIWeb Resolution
AE_index AE Index, nT 1-min
SYM-H_index SYM/H, nT 1-min
F107 Solar index F10.7 1-hour *
BimfxGSE Bx, GSE/GSM, nT 1-min
BimfyGSE By, GSE, nT 1-min
BimfzGSE Bz, GSE, nT 1-min
VxSW_GSE Vx Velocity, GSE, km/s 1-min
VySW_GSE Vy Velocity, GSE, km/s 1-min
VzSW_GSE Vz Velocity, GSE, km/s 1-min
NpSW Proton Density, n/cc 1-min
Temp Proton Temperature, K 1-min
* Solar index F10.7 is not available at higher resolution.
Cluster Science Archive (ESA)
Through the Cluster Archive Inter-Operability Subsystem, we have access to the proton intensities in 7 energy channels measured by RAPID instrument onboard Cluster satellite. We got the following variables between 2001 and 2009 from the CDF files:
Variables Dataset ID in CAIO Variable in CDF Files Resolution
p1, p2, p3, p4, C4_CP_RAP_HSPCT Proton_Dif_flux__C4_CP_RAP_HSPCT 4127-ms
p5, p6, p7
x_km, y_km, z_km* C4_CP_AUX_POSGSE_1M sc_r_xyz_gse__C4_CP_AUX_POSGSE_1M 1-min
* Positions in km are not included in the final dataset.
2.2 Custom Features
OMNIWeb (NASA)
Variable Source
Pdyn NpSW * (VxSW_GSE^2 + VySW_GSE^2 + VzSW_GSE^2) * 1.67e-6
Cluster Science Archive (ESA)
Variable Source
x x_km / 6371.1
y y_km / 6371.1
z z_km / 6371.1
rdist sqrt(x^2 + y^2 + z^2)
2.3 Sampling and Interpolation
- We use the value of F107 at a 1-hour resolution to represent all values in each 1-hour bin.
- For successful integration, we sampled the proton intensities to a resolution of 1 minute. More specifically, we calculate the averaged proton intensities for seven channels in each minute and use them to represent the values at first second in each minute, e.g., values at 2001/1/9 15:22:00 are calculated with the data from 15:22:00 - 15:22:59.
2.4 Integration
As now data from different sources can be aligned with Datetime at a resolution of 1 minute, we can merge them.
2.5 Cleaning
At last, we dropped the rows with outliers (fill values) in any OMNI variable. Please refer to the in the description from OMNIWeb for more information about the fill values.
The raw data generated is available in the package with the name RAPID_OMNI_ML_023_raw.h5.
3 Experiment-specific Pre-processing
Besides, we took the steps below to the dataset for our experiments.
3.1 Splitting
The dataset is split into a training set and a test set with a ratio of 8:2.
3.2 Filtering
- We filtered out the rows with rdist less than or equal to 6.
- We also use NaNs to substitute the proton intensities less than or equal to the threshold, which is 5, 1, 0.5, 0.1, 0.05, 0.005 or 0.001 respectively for 7 channels.
3.3 Transform
- We did not perform any transform or scaling directly on the data in pre-processing. Instead, a Robust Scaler fit with the training data was dumped as a file and used in the experiment.
The pre-processed data and scaler are available under names RAPID_OMNI_ML_023_traincut.h5, RAPID_OMNI_ML_023_testcut.h5 and RAPID_OMNI_ML_023_robusttranscut.pkl.
创建时间:
2021-08-05



