Automating the interpretation of PM2.5 time-resolved measurements using a data-driven approach

NIAID Data Ecosystem2026-03-12 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.7941%252FD1HG9J

下载链接

链接失效反馈

官方服务：

资源简介：

The rapid development of automated measurement equipment enables researchers to collect greater quantities of time-resolved data from indoor and outdoor environments. The interpretation of the resulting data can be a time-consuming effort. This dataset contains the R code and time-resolved indoor and outdoor PM2.5 data to illustrate a machine learning approach called Random Forest (RF). The method is used to study a dataset of 836 emission events that occurred over a two-week period, each in 18 apartments in California. The resulting RF model is applied to analyze PM2.5 data of an entirely separate dataset collected from 65 new homes in California. The RF model identifies 442 indoor emission events, with a few misidentifications. In the accompanying paper, we present the RH model development and evaluate its performance as the sample size and source vary. We discuss the characteristics of the dataset that tended to help the source identification and why. For example, we show that data from many events and from different apartments are essential for the model to be suitable for analyzing the new separate dataset. We also show that longitudinal data appears to be more helpful than the time frequency of measurements in a given apartment. Methods The ‘Dataset’ directory contains two datasets of indoor and outdoor PM2.5 data that were previously collected from field studies conducted by our research group at the Lawrence Berkeley National Laboratory. Dataset 1 contains PM2.5 data that were collected by Noris et al. (2013) from two-weeks of monitoring in 18 low-income apartments in California. Dataset 1 is used as the training dataset, where the indoor PM emission events were previously analyzed by Chan et al. (2018) using a rule-based method. Dataset 2 contains PM2.5 data that were collected by Singer et al. (2020) from 65 new California single-family homes for one week each. The 18 apartments in Dataset 1 were identified by building number (‘Bldg’ = 1, 2, or 3), apartment number (‘Apt’ = 1 to 6), and whether the data was collected before (‘Period = 1) or after (‘Period = 2’) retrofit. The 65 single-family homes in Dataset 2 were identified by building number (‘Bldg’). An adjustment factor of 1.23 was applied to the indoor PM2.5 concentration “data_value_raw” measured using a photometer for Dataset 2, see Singer et al. (2020) for more details. The PM2.5 concentrations in Dataset 1 already incorporated an adjustment factor, see Chan et al. (2018) for more details. Both datasets were processed to calculate the following “features”, some of which were used in the Random Forest model. Indoor_value is the indoor PM2.5 concentration (ug/m3) Back_diff_x, where x = 1, 2, 3, 4, 5, and 10, corresponding to the backward-difference in indoor PM2.5 (ug/m3) in relation to the value at x timestep before it. Front_diff_x, where x = 1, 2, 3, 4, 5, and 10, corresponding to the frontward-difference in indoor PM2.5 (ug/m3) in relation to the value at x timestep after it. Variance_y_min, where y = 4, 8, 12, and 16, corresponding to the standard deviation of y minutes of indoor PM2.5 (ug/m3) centering at the current timestep. Outdoor_value is the outdoor PM2.5 concentration (ug/m3) Outdoor_hourly is the 1-hour average outdoor PM2.5 (ug/m3) calculated using data from the previous hour ending at the current timestep. Extreme_point is a data flag: 1 means the current timestep of indoor PM2.5 is a local minimum or maximum, 0 = no Extreme_forward is the indoor PM2.5 concentration (ug/m3) at the next local minimum or maximum datapoint Extreme_backward is the indoor PM2.5 concentration (ug/m3) at the previous local minimum or maximum datapoint Extreme_diff = Extreme_forward Extreme_backward, is the difference in indoor PM2.5 (ug/m3) between two local minimum or maximum datapoint Extreme_forward_outdoor is the outdoor PM2.5 (ug/m3) at the next local minimum or maximum datapoint Extreme_backward_outdoor is the outdoor PM2.5 (ug/m3) at the previous local minimum or maximum datapoint In addition to the above, the training Dataset 1 also contains the following data flags that were determined previously by Chan et al. (2018) using the rule-based method. Emission is a data flag indicating whether the current datapoint was part of an indoor emission event: 1 = yes, 0 = no Backward_E is a data flag indicating whether the pervious local minimum or maximum datapoint was part of an indoor emission event: 1 = yes, 0 = no Forward_E is a data flag indicating whether the next local minimum or maximum datapoint was part of an indoor emission event: 1 = yes, 0 = no Decay is a data flag indicating whether the current datapoint was part of a decay period following an indoor emission: 1 = yes, 0 = no The ‘Dataset’ directory contains a third input file ‘Dataset2_Volume.csv’. The file provides data about the approximate well-mixed air volume of the 65 single-family homes, which is needed to compute indoor PM2.5 emission rates for Dataset 2. The well-mixed air volume (ft3) is computed by ‘FloorArea_sqft’ x ‘CeilingHgt_ft x ‘Factor’. ‘Factor’ is the % of the house air volume in the vicinity of the photometer used to measure indoor PM2.5, where the PM2.5 concentration was assumed to be well-mixed during the indoor emission event and decay period.

创建时间：

2020-12-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集