Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates

Mendeley Data2024-06-22 更新2024-06-28 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.z34tmpgm7

下载链接

链接失效反馈

官方服务：

资源简介：

# Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates This dataset contains data used to produce the predictions and other results reported in [Greenhill et al. (2023)](https://doi.org/10.1126/science.adi3794). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper via this repository. We are also providing access to all data used to train the models in another Dryad repository: . All code written for the project is available at [https://doi.org/](https://doi.org/10.5061/dryad.2rbnzs7v0)[10.5281/zenodo.10108709](https://doi.org/10.5281/zenodo.10108709). ## Description of the data and file structure The files here include: * Model predictions, including for grid points, PJDs, Sackett points, and navigable waters: `wotus_model_predictions.zip`, `resource_type_model_predictions.zip`, `cowardin_code_predictions.zip`, `ajd_model_predictions.zip` * Input layers used for prediction: * `grid_1.zip`: a random sample of approximately 80,000 of the prediction grid points * `navigable_water_prediction_points.zip` * `Sackett_prediction_points.zip` * `PJD_prediction_points.zip` * Auxiliary data: * `text_questions_data.zip`: Data used for producing in-text statistics * `ajd_point_intersections.csv`: Intersections between AJDs and various other geophysical layers such as NWI polygons, NHD flowlines, urban growth areas, etc. * `prediction_points_to_drop.zip`: `pickle` objects containing IDs of missing and/or corrupted prediction points. These can occur when a layer is missing data for the requested prediction point. * `prediction_point_metadata.csv`, `sackett_metadata.csv`: files containing metadata about the 4 million and Sackett prediction points, respectively, including state, ACE district, and HQ distance. This information is used to create ordinal layers. Note that `PJD_prediction_points.zip` and `navigable_water_prediction_points.zip` contain analogous files, `pjd_metadata.csv` and `nav_water_metadata` for those sets of prediction points. * Data for creating displays: `table*_data.zip`, `figure*_data.zip` ## Description of file contents Each set of files described in the bulleted list above has contents that are structured in similar ways and contain similar information. The contents for each category are described in more detail below, as appropriate. Unless otherwise noted, any N/A or null values represent missing values. ### Model predictions The files in this category are `wotus_model_predictions.zip`, `resource_type_model_predictions.zip`, `cowardin_code_predictions.zip`, `ajd_model_predictions.zip`. Each of these is a zipped directory containing model outputs (predictions) from each model. The directories and the files in each of them are: * `wotus_model_predictions`: predictions from WOTUS-ML. There are files for each of the 50 grids (e.g., `grid1_predictions_Rapanos.csv`); the points on traditional navigable waters (e.g. `nav_water_predictions_Rapanos.csv`), the 3,000 points near the Sackett property (e.g. `Sackett_predictions_Rapanos.csv`; predictions the 101,000 preliminary jurisdictional determinations (PJDs); and predictions for the training, validation, and test sets of Approved Jurisdictional Determinations (AJDs). See SM section A.3 for details on the AJD and PJD data and SM section A.5 for details on the other prediction points. For each set of prediction points, there are three files, corresponding to each of the three WOTUS rules we analyze. We denote the rules using suffixes: `_Rapanos` for *Rapanos*, `_CWR` for the Clean Water Rule, and `_NWPR` for the Navigable Waters Protection Rule. * Each of the grid, navigable waters, and Sackett predictions files has the same two columns: * `pointid`: the point identifier * `probability_wotus`: the WOTUS-ML model score (a number between 0 and 1). * The training, validation, and test set predictions have the following columns: * `pointid`: same as above * `probability_wotus`: same as above * `predictions`: the rounded WOTUS-ML model score (0 or 1) * `labels`: the WOTUS decision from the AJD (0 or 1) * `preds_batch`: the batch number (training and validation predictions only) * `epoch`: the epoch (training and validation predictions only). * The file `all_preds.shp`, along with its auxiliary files, is a shapefile containing the combined grid predictions for all rules and grids. The columns of the shapefile's attribute table are: * `grid_cell`: the grid cell number * `process_or`: the grid number (1 through 50) * `lon`: longitude * `lat`: latitude * `prediction_id`; `pointid`: the prediction pointid * `Rapanos_pr`, `CWR_prob`, `NWPR_prob`: the WOTUS-ML score for Rapanos, CWR, and NWPR, respectively. * `district`: the Army Corps of Engineers (ACE) district, using the district abbreviation. * In addition, the training and validation predictions have columns `ajd_predictions`, the rounded WOTUS-ML model score (0 or 1) and * `ajd_model_predictions`: predictions from AJD-ML. Files include `train_preds.csv`, `val_preds.csv`, and `test_set_predictions.csv`. These contain predictions for the training, validation, and test set, respectively. Variables in each file include: * `pointid`: the point identifier. * `probability_ajd`: the AJD-ML model score (number between 0 and 1). * `ajd_predictions`: the rounded AJD-ML model score (0 or 1). * `ajd_labels`: the label (0 or 1). * `epoch`: the epoch for which the other values were calculated. Note that `test_set_predictions.csv` does not have this column as prediction on the test set was done only after model training using the best model ### Input layers used for prediction The files in this category are `grid_1.zip`, `navigable_water_prediction_points.zip`, `Sackett_prediction_points.zip`, and `PJD_prediction_points.zip`. Each of these is a zipped directory containing input layers that are fed into WOTUS-ML to produce predictions. All files are formatted as Nx512x512 arrays centered at the point of interest and saved as tiff files. Some files contain a single layer (N=1), but others contain up to 9 layers (N=9). The files are named according to their prediction ids, which are described in the metadata files (see the "Metadata" section below). For more details on the variables, see Table S2 in the Supplementary Materials. The subfolders are: * `NAIP`: 4x512x512 arrays containing the Red, Green, Blue and Near Infrared channels from National Agricultural Imagery Program (NAIP) imagery * `NWI`: 1x512x512 arrays containing wetland types from the National Wetlands Inventory (NWI). The mapping from wetland types to numbers is: estuarine and marine deepwater = 1; estuarine and marine wetland = 2; freshwater emergent wetland = 3; freshwater forested/shrub wetland = 4; freshwater pond = 5; lake = 6; riverine = 7; other = 8. * `NHD`: 5x512x512 arrays containing features from the National Hydrography Dataset (NHD), including Fcode (the water type feature, e.g. perennial stream or intermittent stream); Path length (the distance of the NHD flowline); stream order; high flow (the maximum flow value for this water segment); and low flow (the minimum flow value for this water segment). * `DEM`: 1x512x512 arrays containing elevation of the point above sea level, in meters, from the USGS 3-D Elevation Program's digital elevation model (DEM). * `PRISM`: 9x512x512 arrays containing 30-year climate normals at the point from the Parameter-elevation Regressions on Independent Slopes Model (PRISM), including long-run averages of minimum, maximum, and mean temperature; mean dew point temperature; minimum and maximum vapor pressure deficit; clear sky and total solar radiation; and cloudiness. * `PPT`: 1x512x512 arrays containing average annual total precipitation at the point, from PRISM. * `Ecoregions`: 1x512x512 arrays containing information about the point's ecoregion from the US EPA Level IV Ecoregions. * `gNATSGO`: 5x512x512 arrays containing soil information about the point from the gridded National Soil Survey Geographic Database (gNATSGO), including taxonomic class, hydric rating (whether the map unit is a "hydric soil"), flooding frequency (the annual probability of a flood event), ponding frequency (the number of times ponding occurs per year), and water table depth (the shallowest depth, in centimeters, to a wet soil layer). * `NLCD`: 1x512x512 arrays containing the point's land cover class, taken from 20 land cover classes from the National Land Cover Database, including open water, ice/snow, four classes of developed land (open, low, medium, and high), barren, three forest classes (evergreen, deciduous, mixed), two scrub classes (dwarf, shrub), four herbaceous classes (grassland, sedge, moss, lichen), two agricultural classes (pasture/hay, cultivated), and two wetland classes (woody, emergent herbaceous). * `ACE_districts`: 1x512x512 arrays corresponding to which ACE district each point is located in. These are encoded numerically; see `2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R` in the [code repository](https://zenodo.org/records/10108709) for details. * `states`: 1x512x512 arrays corresponding to which US state, as defined by the Topologically Integrated Geographic Encoding and Referencing System (TIGER)/Line State boundaries, each point is located in. These are encoded numerically; see `2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R` in the [code repository](https://zenodo.org/records/10108709) for details. * `rules`: 1x512x512 arrays corresponding to each WOTUS rule. These are stored in subdirectories for each rule: `CWR`, `NWPR`, and `Rapanos`, and consist of arrays of a single value, with a value of 1 for Rapanos, a value of 2 for the CWR, and a value of 3 for NWPR. * `dist_HQ`: 1x512x512 arrays containing the distance, in meters, from the point to the headquarters of the ACE district the point is in. In addition, `grid_1.zip`, `navigable_water_prediction_points.zip`, and `PJD_prediction_points.zip` contain information about points that should be dropped from the metadata because they are missing from the input layers or are corrupted files, which can occur if the requested area is in an ocean, Great Lake, or restricted military area. These files contain numpy arrays of the points to be dropped. For more details, see the prediction code under `4_dl_models` in the [code repository](https://zenodo.org/records/10108709). ### Auxiliary data The contents of the auxiliary files include: * `text_questions_data.zip`: * `DeregulatedPointsRapanosToNWPR.csv`: prediction points from the 4 million grid points which are deregulated between Rapanos and NWPR. Column names are the same as column names in `all_preds.shp`. * `prediction_point_metadata.csv`: state, district, and distance to headquarters information for the 4 million grid points. Columns are: * `prediction_id`: the point identifier * `state`: the state FIPS code * `district`: the ACE district, using the district abbreviation. * `distHQ`: the distance, in meters, from the point to the headquarters of the ACE district the point is in. * `sample_grid_points.csv`: the 4 million grid points. Columns are: * `grid_cell`: the grid cell number * `process_or`: the grid number (1 through 50) * `lon`: longitude * `lat`: latitude * `prediction_id`: the prediction pointid * `testSetWithAJDinfo.csv`: the AJD test set with additional information merged in from the AJD database. Key columns are: * `jdid`: the id assigned by ACE * `agency`: the agency making the jurisdictional determination * `projectid`: the project id assigned by ACE * `districtorregion`: the ACE district * `jdbasis`: the WOTUS rule used as the basis for the determination * `pdflink`: a link to a pdf of the determination, if available * `finalizeddate`: date the determination was finalized * `closuremethod`: Whether the determination required a field visit or not * `watname`: the name of the water resource evaluated for the determination * `resourcetypes`: the short code describing the resource type * `resourcetypedescription`: a longer description of the resource type * `wateroftheus`: WOTUS decision (Yes or No) * `cowardincode`: the Cowardin code * `cowardincategory`: The Cowardin category * `cowardindescription`: the description of the Cowardin category * `longitude`: the longitude of the centroid of the water resource (see SM section A.4 for discussion) * `latitude`: the latitude of the centroid of the water resource (see SM section A.4 for discussion) * `state`: US state name * `county`: US county name * `pointid`: the point identifier * `prob_cnn`: the WOTUS-ML score * `predictions`: the rounded WOTUS-ML model score (0 or 1) * `labels`: the WOTUS decision (0 or 1) * `group_all`: indicator for an AJD decided under any rule * `group_rapanos`: indicator for an AJD decided under Rapanos * `group_nwpr`: indicator for an AJD decided under NWPR * `group_cwr`: indicator for an AJD decided under CWR * `ajd_decision`: the WOTUS decision (0 or 1) * `accuracy`: share of AJDs with rounded WOTUS-ML model score (0 or 1) equal to WOTUS decision (0 or 1) * `sh_above_score_cutoffs`: share of validation AJDs with WOTUS-ML score above each cutoff in `score_cutoffs_hi`. * `accuracy_above_score_cutoffs`: accuracy of WOTUS-ML for validation AJDs with WOTUS-ML score above each cutoff in `score_cutoffs_hi`. Used to graph accuracy curves * `score_cutoffs_hi`: cutoffs from 0.5-1.0 used by `accuracy_above_score_cutoffs` * `sh_below_score_cutoffs`: share of validation AJDs with WOTUS-ML score below each cutoff in `score_cutoffs_lo`. * `accuracy_below_score_cutoffs`: accuracy of WOTUS-ML for validation AJDs with WOTUS-ML score below each cutoff in `score_cutoffs_lo`. Used to graph accuracy curves * `score_cutoffs_lo`: cutoffs from 0.0-0.50 used by `accuracy_below_score_cutoffs` * `xval`: score cutoffs on x axis for accuracy curve * `yval`: share of validation AJDs with at least the accuracy in `xval`. Used to graph accuracy curve * `ajd_point_intersections.csv`: * `pointid`: The AJD point identifier * `nwi`: boolean; true if the point intersects any NWI polygon; false otherwise * `nwi_wetland_type`: NA if `nwi == False`, otherwise a string describing the wetland type in NWI (estuarine and marine deepwater, estuarine and marine wetland, freshwater emergent wetland, freshwater forested/shrub wetland, freshwater pond, lake, riverine, and other) * `nhd`: boolean; true if the point intersects any NHD polygon; false otherwise * `nhd_fcode`: NA if `nhd == False`, otherwise the fcode from NHD * `navigable_water`: boolean; true if the point is a navigable water as defined in SM section A.5; false otherwise * `navigable_water_and_nwi`: bolean; true if `navigable_water == True` and `nwi == True`; false otherwise * `iclus_growth`: boolean; true if the point is in an area defined by ICLUS to move from undeveloped to semi-developed, semi-developed to developed, or undeveloped to developed. See SM section A.4 for details #### Metadata * `sackett_metadata.csv`: * `id`: The Sackett prediction id * `state`: the state FIPS code * `district`: the 3-letter ACE district abbreviation * `distHQ`: the distance, in meters, of the point to the ACE district headquarters * `prediction_point_metadata.csv`: * `prediction_id`: the prediction point id * `state`: the state FIPS code * `district`: the 3-letter ACE district abbreviation * `distHQ`: the distance, in meters, of the point to the ACE district headquarters ### Data for displays The zip files `table*_data.zip`, `figure*_data.zip` contain data necessary to replicate each of the figures and tables in the main text and SM, and do not appear elsewhere in this replication package. In some cases, files are duplicated from elsewhere in this repository, and in some cases the contents of the zip files are identical. Note there are no zip files corresponding to figures that do not require any data, e.g. figure 1. * `fig2A_data`, `figS6_data`, `figS9_data`: * `test_set_predictions.csv`: See description of `wotus_model_predictions.zip` above. * `AJD_jds_wet_dry_season_clean_v2.csv`, `jds202205312309.csv`: See description of `testSetWithAJDInfo.csv` above. * Note that this figure can also be replicated using `testSetWithAJDInfo.csv` only. Also note that all three zip files contain the same data. * `fig2B_data`: * `prediction_point_metadata.csv`, `sample_grid_points.csv`, `testSetWithAJDinfo.csv`: See file descriptions above. * `fig3_data`: * `prediction_point_metadata.csv`, `sample_grid_points.csv`: See file descriptions above. * `preds.csv`: The 4 million predictions, combined into a single file. See the description of `all_preds.shp` above. * `fig4_data`: * `ID_shapefile_wetlands`: a directory containing the shapefile and auxiliary files for Idaho wetlands from NWI. * `zoom_areas`: * `zoom_areas_naip`: `.tiff` files containing NAIP imagery for each of the zoom areas in figure 4. * `zoom_areas.shp`: a shapefile and auxiliary files containing the polygons defining the zoom areas in figure 4. * `Sackett_sample_NAIP_tiles.csv`: a file describing the geographic information for the Sackett prediction points. Key fields include: * `prediction_id`: the Sackett prediction ID * `lat`: the latitude coordinate of the point * `lon`: the longitude coordinate of the point * `figS3_data`: * `prediction_point_metadata.csv`, `sample_grid_points.csv`, `preds.csv`: See file descriptions above. * `figS4_data`: * `nhdPlusRegionsCombined.shp`: combination of all NHDPlusV2 regions from EPA's NHD data * `streamleve`: stream level in NHD * `PRISM_ppt_30yr_normal_4kmM4_annual_asc.tif`: tif of PRISM precipitation. Downloaded from * `nlcd_2019_land_cover_l48_20210604.img`: NLCD 2019 Land Cover. Downloaded from * `USGSNAIPImagery.tif`: NAIP Imagery * `NAIPmapping.qgz`, `NLCDmapping.qgz`, `PRISMmapping.qgz`: QGIS projects used to create their respective maps * `table1_data`: * `sample_grid_points.csv`, `test_set_predictions_*.csv`, `testSetWithAJDinfo.csv`: See file descriptions above. * `navigable_comids_wlatlon.txt`: * `comid`: COMID (stream segment identifier from NHD) * `latitude`: latitude coordinate * `longitude`: longitude coordinate * `gnis_name`: stream name from the USGS Geographic Name Information System * `tableS4_data`, `tableS6_data`: * `AJD_jds202205312309_clean.csv`, `ajd_point_intersections.csv`, `AJD_jds_wet_dry_season_clean_v2.csv`: See file descriptions above. * `pointid_resourcetype_crosswalk.csv`: * `pointid`: the AJD pointid. * `ai_cowardin`: a 9-class categorization of cowardin codes (see table S1) * `cowardin_numeric`: a numeric encoding of `ai_cowardin` * `cowardin_simple`: a 4-class categorization of cowardin codes into wetland, stream, or other. Note this is not used in the paper. * `ai_resourcetype`: a 9-class categorization of resource types (see table S2) * `resource_numeric`: a numeric encoding of `ai_resourcetype` * `tableS5_data`: * `prediction_point_metadata.csv`, `sample_grid_points.csv`, `testSetWithAJDinfo.csv`: See file descriptions above. * `tableS7_data`: * `prediction_point_metadata.csv`, `sample_grid_points.csv`, `testSetWithAJDinfo.csv`: See file descriptions above. * `nhd_stats_AI_state.csv`: * `comid`: COMID * `long_comid`: the COMID's longitude * `lat_comid`: the COMID's latitude * `ftype`: the NHD feature type * `fcode`: the NHD feature code * `intephem`: 1 if ephemeral, 0 otherwise * `streamorder`: Stream Order * `lengthkm`: Path length in km * `STUSPS`: FIPS state postal code * `nhd_stream_miles_by_state.csv`: * `STUSPS`: 2-character USPS state code * `lengthmi`: Total stream length, in miles * `nwi_acres_by_state.csv`: * `NAME`: State name * `STUSPS`: 2-character USPS state code * `STATEFP`: State FIPS code * `nwi_all_acres`: Total NWI acres * `nwi_wetland_acres`: Total NWI acres in one of the wetland types * `tableS8_data`: * `PWS_Locations_HUC12_2022Q2.xlsx`: list of all public water systems served by water sources within each HUC12 * `HUC_12`: HUC12 region * `PWSID`: public water system id of systems served by the `huc12` * `WBD_HUC12.shp`: shapefile and auxiliary files for the HUC 12 watershed boundary dataset * `huc12`: HUC12 region * `PredictionPointsByHuc12PWSIDNhdNwiPopulationServed.csv`: spatial join of WOTUS-ML prediction points to HUC12 polygons (from `WBD_HUC12`), the public water systems served by said HUC12 (from `PWS_Locations_HUC12_2022Q2`) and the population served by each public water system (from `sdwis_active_years`). * `prediction_id`: the prediction point id * `pwsid`: public water system id * `population_served`: population served by the `pwsid` * `dereg`: indicator if WOTUS-ML predicts the prediction point is regulated under Rapanos, but not regulated under NWPR * `Rapanos_prob`, `CWR_prob`, `NWPR_prob`: the WOTUS-ML score for Rapanos, CWR, and NWPR, respectively. * `Rapanos_prediction`, `CWR_prediction`, `NWPR_prediction`: the rounded WOTUS-ML score for Rapanos, CWR, and NWPR, respectively. * `nwi`: boolean; true if the point intersects any NWI polygon; false otherwise * `nwi_wetland_type`: NA if `nwi == False`, otherwise a string describing the wetland type in NWI (estuarine and marine deepwater, estuarine and marine wetland, freshwater emergent wetland, freshwater forested/shrub wetland, freshwater pond, lake, riverine, and other) * `nhd`: boolean; true if the point intersects any NHD polygon; false otherwise * `nhd_fcode`: NA if `nhd == False`, otherwise the fcode from NHD * `sdwis_active_years.dta`: list of public water systems active in the Environmental Protection Agency's SDWIS database in each year. * `pwsid`: public water system id * `pws_type_code`: public water system type (community water system - CWS; non-transient non-community water system - NTNCWS; transient non-community water system - TNCWS) * `active`: indicator; 1 if this `pwsid` was active in this `year` * `year`: calendar year

创建时间：

2023-12-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集