Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates

Mendeley Data2024-06-22 更新2024-06-28 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.m63xsj47s

下载链接

链接失效反馈

官方服务：

资源简介：

# Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates This dataset contains data used to train the models in [Greenhill et al. (2023)](https://doi.org/10.1126/science.adi3794). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to the full set of pre-processed inputs for model training via this repository. We are also providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper, in another Dryad repository: . All code written for the project is available at . ## Description of the data and file structure The files here include: * Trained models, saved in PyTorch Checkpoint format: `wotus_model.pth.tar`, `resource_type_model.pth.tar`, `cowardin_code_model.pth.tar`, `ajd_model.pth.tar`. * Train test splits and inputs to their creation: * `train_test_split_naip_only.csv`: The split used for the Waters of the United States (WOTUS) model * `ajd_train_test_split.csv`: The split used for the Approved Jurisdictional Determination (AJD) model * `NAIP_tiles.zip`: a shapefile of the geographic footprints of the imagery tiles, which are used to group overlapping footprints * Raw data: `raw_ajd_data.zip`, `raw_resourcetype_cowardincode_data.zip` * Processed input data, including all the layers used to train and evaluate the model. These have been normalized and augmented, and are saved in a compressed `.npz` format: `train_val_data.zip` and `test_data.zip`. ## Description of file contents Each set of files described in the bulleted list above has contents that are structured in similar ways and contain similar information. The contents for each category are described in more detail below, as appropriate. Unless otherwise noted, any N/A or null values represent missing values. ### Trained models These are saved PyTorch objects. To load the model, instantiate a ResNet-18, then load the checkpoint using `torch.load`. For details, see the [PyTorch documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html). For an example, see `4_dl_models/wotus/predict/predict_grid.py` in the [code repository](https://zenodo.org/records/10108709). ### Train test splits and inputs to their creation * `train_test_split_naip_only.csv`: * `pointid`: the AJD point id * `jdid`: the id assigned by the Army Corps of Engineers (ACE) * `projectid`: project id assigned by ACE * `split_group`: the split assignment (train, test, or val) * `ace_district`: the ACE district of the pointid * `rule`: the WOTUS rule used to decide the AJD * `wotus`: the WOTUS decision (Yes or No) * `ajd_train_test_split.csv`: * `id`: the point id. The format of this field is `ajd_XXXXXX` if the point is drawn from the AJD dataset and `pred_XXXXXX` if the point is drawn from the 4 million prediction points * `pointid`: the AJD point id if the point is from the AJD dataset, empty otherwise * `projectid`: the AJD project id if the point is from the AJD dataset, empty otherwise * `ajd`: 1 if the point is from the AJD dataset, 0 if it is from the 4 million prediction points * `prcss_r`: the grid number (1 through 50) if the point is from the 4 million prediction points, empty otherwise * `group_id`: the id of the group of overlapping points * `split_group`: the split assignment (train, test, or val) * `NAIP_tiles.zip`: a shape file containing the geographic footprints of the NAIP tiles ### Raw data files * `raw_ajd_data.zip`: a zip archive containing AJD data. * `jds202205312309_clean.csv`; `jds202204211420_clean.csv`: * `JD ID`: the id assigned by ACE * `Agency`: the agency making the jurisdictional determination * `Project ID`: the project id assigned by ACE * `District or Region`: the ACE district * `JD Basis`: the WOTUS rule used as the basis for the determination * `PDF Link`: a link to a pdf of the determination, if available * `Finalized Date`: date the determination was finalized * `Closure Method`: Whether the determination required a field visit or not * `Waters Name`: the name of the water resource evaluated for the determination * `Resource Types`: the short code describing the resource type * `Resource Type Description`: a longer description of the resource type * `Water of the U.S.`: WOTUS decision (Yes or No) * `Cowardin Code`: the Cowardin code * `Cowardin Category`: The Cowardin category * `Cowardin Description`: the description of the Cowardin category * `Longitude`: the longitude of the centroid of the water resource (see SM section A.4 for discussion) * `Latitude`: the latitude of the centroid of the water resource (see SM section A.4 for discussion) * `State`: US state name * `County`: US county name * `jds_wet_dry_season_clean_v2.csv`: * `jdid`: the id assigned by ACE * `agency`: the agency making the jurisdictional determination * `projectid`: the project id assigned by ACE * `districtorregion`: the ACE district * `jdbasis`: the WOTUS rule used as the basis for the determination * `pdflink`: a link to a pdf of the determination, if available * `finalizeddate`: date the determination was finalized * `closuremethod`: Whether the determination required a field visit or not * `watname`: the name of the water resource evaluated for the determination * `resourcetypes`: the short code describing the resource type * `resourcetypedescription`: a longer description of the resource type * `wateroftheus`: WOTUS decision (Yes or No) * `cowardincode`: the Cowardin code * `cowardincategory`: The Cowardin category * `cowardindescription`: the description of the Cowardin category * `longitude`: the longitude of the centroid of the water resource (see SM section A.4 for discussion) * `latitude`: the latitude of the centroid of the water resource (see SM section A.4 for discussion) * `state`: US state name * `county`: US county name * `raw_resourcetype_cowardincode_data.zip`: a zip archive containing cowardin code and resource type data for the AJDs. * `ajds_with_resourceTypes_for_multiTaskLearning.csv`: * `ai_cowardin`: a 9-class categorization of cowardin codes (see table S1) * `ai_resourceType`: a 9-class categorization of resource types (see table S2) * All other columns same as above. * `pointid_resourcetype_crosswalk.csv`: * `pointid`: the AJD pointid. * `ai_cowardin`: a 9-class categorization of cowardin codes (see table S1) * `cowardin_numeric`: a numeric encoding of `ai_cowardin` * `cowardin_simple`: a 4-class categorization of cowardin codes into wetland, stream, or other. Note this is not used in the paper. * `ai_resourcetype`: a 9-class categorization of resource types (see table S2) * `resource_numeric`: a numeric encoding of `ai_resourcetype` ### Processed input data The files in this category are `train_val_data.zip` (training and validation data) and `test_data.zip` (test data). Each of these is a zipped directory containing input layers that are fed into WOTUS-ML. All files are numpy array saved as `.npz` files. The file naming convention is `{pointid}.npz`, where `pointid` is the point id. In addition, there is a dictionary saved as a pickle file, `data_dict.p`, containing metadata about the files. This dictionary is populated automatically by the code that creates the input layers (see `3_src/data.py` in the [code repository](https://zenodo.org/records/10108709)). The dictionary is reproduced below: ``` {'naip_ids': [0, 1, 2, 3], 'nwi_id': [4], 'nhd_ids': [5, 6, 7, 8, 9], 'dem_id': [10], 'ecoregion_id': [11], 'nlcd_id': [12], 'prism_ids': [13, 14, 15, 16, 17, 18, 19, 20, 21, 22], 'gnatsgo_ids': [23, 24, 25, 26, 27], 'district_dummies_id': [28], 'rule_dummies_id': [29, 30, 31], 'hq_dist_id': [32], 'state_id': [33], 'augment': True, 'normalize': True} ``` The keys are the names of the input layers, and the values are lists of the indices of the layers in the numpy array. The `augment` and `normalize` keys indicate whether the data were augmented and normalized, respectively. Augmentation includes random rotation and flipping. Normalization is done using the mean and standard deviation of each layer in the training data. The values used for normalization are stored in the folder `4_dl_models/wotus/train/layer_mean_sd` in the [code repository](https://zenodo.org/records/10108709). More details about the layers are provided below. Also see SM Table S3. * `naip_ids`: the Red, Green, Blue and Near Infrared channels from National Agricultural Imagery Program (NAIP) imagery * `nwi_id`: wetland types from the National Wetlands Inventory (NWI). The mapping from wetland types to numbers is: estuarine and marine deepwater = 1; estuarine and marine wetland = 2; freshwater emergent wetland = 3; freshwater forested/shrub wetland = 4; freshwater pond = 5; lake = 6; riverine = 7; other = 8. * `nhd_id`: features from the National Hydrography Dataset (NHD), including Fcode (the water type feature, e.g. perennial stream or intermittent stream); Path length (the distance of the NHD flowline); stream order; high flow (the maximum flow value for this water segment); and low flow (the minimum flow value for this water segment). * `dem_id`: elevation of the point above sea level, in meters, from the USGS 3-D Elevation Program's digital elevation model (DEM). * `ecoregion_id`: information about the point's ecoregion from the US EPA Level IV Ecoregions. * `nlcd_id`: the point's land cover class, taken from 20 land cover classes from the National Land Cover Database, including open water, ice/snow, four classes of developed land (open, low, medium, and high), barren, three forest classes (evergreen, deciduous, mixed), two scrub classes (dwarf, shrub), four herbaceous classes (grassland, sedge, moss, lichen), two agricultural classes (pasture/hay, cultivated), and two wetland classes (woody, emergent herbaceous). * `prism_ids`: 30-year climate normals at the point from the Parameter-elevation Regressions on Independent Slopes Model (PRISM), including long-run averages of minimum, maximum, and mean temperature; mean dew point temperature; minimum and maximum vapor pressure deficit; clear sky and total solar radiation; cloudiness; and average annual total precipitation. * `gnatsgo_ids`: soil information about the point from the gridded National Soil Survey Geographic Database (gNATSGO), including taxonomic class, hydric rating (whether the map unit a "hydric soil"), flooding frequency (the annual probability of a flood event), ponding frequency (the number of times ponding occurs per year), and water table depth (the shallowest depth, in centimeters, to a wet soil layer). * `district_dummies_id`: which ACE district each point is located in. These are encoded numerically; see `2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R` in the [code repository](https://zenodo.org/records/10108709) for details. * `rule_dummies_id`: each WOTUS rule. These are stored in subdirectories for each rule: `CWR`, `NWPR`, and `Rapanos`, and consist of arrays of a single value, with a value of 1 for Rapanos, a value of 2 for the CWR, and a value of 3 for NWPR. * `hq_dist_id`: the distance, in meters, from the point to the headquarters of the ACE district the point is in. * `state_id`: which US state, as defined by the Topologically Integrated Geographic Encoding and Referencing System (TIGER)/Line State boundaries, each point is located in. These are encoded numerically; see `2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R` in the [code repository](https://zenodo.org/records/10108709) for details.

创建时间：

2023-12-14