Dataset for "Classification of Seismic Events in the Mainland of China Based on Spectrograms and Model Interpretability"
收藏DataCite Commons2026-02-09 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=35afda3bf71d4925a97a5f9057b26cc9
下载链接
链接失效反馈官方服务:
资源简介:
Dataset DescriptionThis dataset contains seismic waveform data collected from China Digital Seismograph Network for earthquake event classification and machine learning research. The dataset supports the ResWaveQuake model described in the associated publication on time-frequency based earthquake event classification in the mainland of China.Data OverviewThe dataset spans from January 2013 to May 2024, covering the mainland of China and adjacent regions. It includes 2,870 seismic events, containing approximately 100,000 associated waveform traces, categorized into:- Natural earthquakes (eq): 1,158 events (926 training, 232 testing)- Explosions (ep): 972 events (778 training, 194 testing)- Collapses/subsidences (cl): 740 events (592 training, 148 testing)The dataset is organized using event-level stratified sampling with approximately 80% allocated for training (2,296 events) and 20% for testing (574 events), while maintaining complete event structure.Data Characteristics and ProcessingData Format and StorageAll waveform data are stored in HDF5 format for efficient access and AI model training. The original MiniSEED format data have been converted and organized by event category, with each HDF5 file containing all events of a specific category and split (e.g., train_eq.h5, test_ep.h5).Waveform Specifications- Sampling rate: 100 Hz- Data length: 200 seconds per record (20,000 samples)- Time window: 50 seconds before P-wave arrival plus 150 seconds after- Components: Three-component data (BHE, BHN, BHZ) representing east-west, north-south, and vertical directions- Component order: Standardized to ENZ order (East, North, Vertical) in HDF5 files- Frequency range: Band-pass filtered to 0.1-25 Hz- Signal-to-noise ratio: ≥ 2 for all records- Normalization: Maximum amplitude normalization applied (data range: [-1, 1])Quality AssuranceQuality assurance measures include:- Retaining only events with ≥1 valid waveform records- Systematic quality control using STA/LTA detection for noise samples- Consistent preprocessing standards across all regional data- Manual expert annotations for all P-wave arrivals- Support for incomplete component data: Stations with missing components (BHE, BHN, or BHZ) are included with NaN-filled missing components, allowing maximum data retention while maintaining consistent data structureIncomplete Component HandlingThe dataset includes stations with incomplete three-component data:- Missing components are filled with NaN values- Each station dataset includes an `available_components` attribute indicating which components are present- The `has_missing_components` attribute flags stations with incomplete data- Missing component indices are stored in the `missing_component_indices` attribute- This approach maximizes data retention while maintaining consistent data structure for machine learning applicationsData CollectionData were collected from 632 unique seismic stations across the mainland of China, covering:- Epicentral distances: 0-800 km- Magnitude range: 0-5- Geographic coverage: the Mainland of ChinaHDF5 File StructureThe HDF5 files are organized hierarchically:category_name/ # Category group (eq, ep, or cl) event_0/ # Event group (anonymized event ID) station_0/ # Station dataset (shape: [3, 20000]) - Attributes: * station_id: station ID (e.g., NX.001) * network: Network code (e.g., NX) * shape: Data shape [3, 20000] * components: Component list ['BHE', 'BHN', 'BHZ'] * component_order: Component order 'ENZ' * available_components: List of available components (e.g., ['BHE', 'BHN', 'BHZ']) * has_missing_components: Boolean flag indicating if any components are missing * missing_component_indices: List of indices for missing components (0=BHE, 1=BHN, 2=BHZ) - Data: numpy array of shape (3, 20000), missing components filled with NaN - Group attributes: * num_events: Number of events in this category * total_waveforms: Total number of waveforms * category: Category name (eq, ep, or cl) * split: Dataset split (train or test) * component_order: Global component order 'ENZ' * num_unique_stations: Number of unique stations * num_complete_component_stations: Number of stations with all three components * num_incomplete_component_stations: Number of stations with missing componentsData Source and AcknowledgmentThe seismic data are provided by International Earthquake Science Data Center at Institute of Geophysics, China Earthquake Administration (Doi:10.11998/IESDC). The data were produced by China Earthquake Networks Center, AH, BJ, CQ, FJ, GD, GS, GX, GZ, HA, HB, HE, HI, HL, HN, JL, JS, JX, LN, NM, NX, QH, SC, SD, SH, SN, SX, TJ, XJ, XZ, YN, ZJ Seismic Networks, China Earthquake Administration.
数据集描述
本数据集包含源自中国数字地震台网的地震波形数据,用于地震事件分类与机器学习研究。本数据集可支撑关联论文中提出的针对中国大陆地区基于时频分析的地震事件分类模型ResWaveQuake。
数据概览
本数据集时间跨度为2013年1月至2024年5月,覆盖中国大陆及周边区域。共包含2870个地震事件,关联波形轨迹约10万条,分为以下三类:
- 天然地震(eq):1158个事件(训练集926个,测试集232个)
- 爆破事件(ep):972个事件(训练集778个,测试集194个)
- 塌陷/沉降事件(cl):740个事件(训练集592个,测试集148个)
本数据集采用事件级分层采样方式组织,约80%的数据划分为训练集(2296个事件),20%划分为测试集(574个事件),并保留完整的事件结构。
数据特征与处理
数据格式与存储
所有波形数据均采用HDF5格式存储,以实现高效访问并适配人工智能模型训练需求。原始MiniSEED格式数据已完成格式转换,并按事件类别与数据拆分方式进行组织:每个HDF5文件包含特定类别与拆分集的全部事件数据,例如train_eq.h5、test_ep.h5。
波形参数规范
- 采样率:100 Hz
- 单条记录时长:200秒(共20000个采样点)
- 时间窗口:P波到达前50秒至到达后150秒
- 分量信息:三分量数据(BHE、BHN、BHZ),分别对应东西向、南北向与垂向
- 分量顺序:HDF5文件中已统一标准化为ENZ顺序(东、北、垂向)
- 频率范围:经带通滤波至0.1~25 Hz
- 信噪比:所有记录信噪比均≥2
- 归一化方式:采用最大幅值归一化,数据值域为[-1, 1]
质量管控
本数据集采用以下质量控制措施:
- 仅保留拥有至少1条有效波形记录的事件
- 采用STA/LTA(短时/长时平均比)检测算法对噪声样本进行系统性质量管控
- 所有区域数据采用统一的预处理标准
- 所有P波到时均经过专家人工标注
- 支持不完整分量数据:对于缺失BHE、BHN或BHZ分量的台站,将缺失分量以NaN值填充后纳入数据集,在保障数据结构一致性的前提下最大化保留有效数据
不完整分量数据处理
本数据集包含三分量数据不完整的台站:
- 缺失分量以NaN值填充
- 每个台站数据集均包含"available_components"属性,用于标识当前台站包含的分量类型
- "has_missing_components"属性用于标记存在分量缺失的台站
- 缺失分量的索引信息存储于"missing_component_indices"属性中
- 该处理方式在最大化保留有效数据的同时,为机器学习应用提供了统一的数据结构
数据采集
本数据集数据源自中国大陆地区632个独立地震台站,覆盖范围如下:
- 震中距范围:0~800 km
- 震级范围:0~5
- 地理覆盖范围:中国大陆地区
HDF5文件组织结构
HDF5文件采用层级化组织结构:
category_name/ # 分类组(如eq、ep或cl)
event_0/ # 事件组(匿名化事件ID)
station_0/ # 台站数据集(数据形状:[3, 20000])
- 属性:
* station_id:台站标识符(例如NX.001)
* network:台网编码(例如NX)
* shape:数据形状[3, 20000]
* components:分量列表['BHE', 'BHN', 'BHZ']
* component_order:分量排列顺序为'ENZ'
* available_components:当前台站实际包含的分量列表(例如['BHE', 'BHN', 'BHZ'])
* has_missing_components:布尔值,标识是否存在分量缺失
* missing_component_indices:缺失分量的索引列表(0对应BHE,1对应BHN,2对应BHZ)
- 数据:形状为(3, 20000)的numpy数组,缺失分量以NaN值填充
组级属性:
* num_events:当前分类下的事件数量
* total_waveforms:当前分类与拆分集下的总波形数量
* category:事件类别(eq、ep或cl)
* split:数据集拆分(训练集或测试集)
* component_order:全局分量顺序,统一采用'ENZ'
* num_unique_stations:当前分类与拆分集下的独立台站数量
* num_complete_component_stations:拥有全部三分量数据的台站数量
* num_incomplete_component_stations:存在分量缺失的台站数量
数据来源与致谢
本数据集的地震数据由中国地震局地球物理研究所国际地震科学数据中心提供(DOI:10.11998/IESDC)。数据由中国地震台网中心及中国地震局下属AH、BJ、CQ、FJ、GD、GS、GX、GZ、HA、HB、HE、HI、HL、HN、JL、JS、JX、LN、NM、NX、QH、SC、SD、SH、SN、SX、TJ、XJ、XZ、YN、ZJ地震台网产出。
提供机构:
Science Data Bank
创建时间:
2025-12-30



