food-ai-nexus/fecal-indicator-water-nys
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/food-ai-nexus/fecal-indicator-water-nys
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- tabular-classification
tags:
- food-safety
- agriculture
- microbiology
- water-quality
- tabular-classification
language:
- en
size_categories:
- n<1K
pretty_name: Predicting Agricultural Water Quality (PAWQ) Project Datasets
---
# Predicting Agricultural Water Quality (PAWQ) Project Datasets
**Predicting Agricultural Water Quality (PAWQ) Project Datasets** is a tabular dataset containing water quality, weather, land use, livestock, and microbial source tracking (MST) markers for predicting foodborne bacterial contamination in New York streams used to source water for produce production.
With this dataset, researchers can train machine learning models to predict the presence of *E. coli* and specific microbial source tracking markers (human, ruminant, avian, canid) in surface waters based on environmental, geographical, and weather predictors.
This dataset accompanies the publications:
- Green H, Wilder M, Wiedmann M and Weller D (2021). Integrative Survey of 68 Non-overlapping Upstate New York Watersheds Reveals Stream Features Associated With Aquatic Fecal Contamination. *Front. Microbiol.* 12:684533.
- Weller, D., A. Belias, H. Green, S. Roof, and M. Wiedmann (2020). Landscape, water quality, and weather factors associated with an increased likelihood of foodborne pathogen contamination of New York streams used to source water for produce production. *Frontiers in Sustainable Food Systems.* (3) 124.
## Content
- The dataset contains **196** unique records collected across 68 non-overlapping Upstate New York watersheds.
- It spans a variety of features including water quality metrics, weather conditions, land cover, and proximity to livestock operations and infrastructure.
- The dataset is imbalanced for binary classification of MST markers, containing 49 positive samples for HF183 (human) and 34 positive samples for Rum2Bac (ruminant).
- The target variables are derived from the `ecoli`, `HF183_pa`, `Rum2Bac_pa`, `DG3_pa`, and `GFD_pa` columns.
## Data Fields
The dataset contains 95 columns.
**Key Features**
| Column | Type | Description |
| :--- | :--- | :--- |
| `ecoli` | float64 | Log10 E. coli concentration in the waterway (MPN/100 mL) |
| `ph` | float64 | pH |
| `cond` | float64 | Conductivity (Log10 uS/cm) |
| `do` | float64 | Dissolved oxygen levels (mg/L) |
| `flow` | float64 | Flow rate measured 3-6” below the surface (m/s) |
| `a_t` | float64 | Air temperature measured at the sampling site at the time of sample collection (°C) |
| `w_t` | float64 | Water temperature (°C) |
| `turb` | float64 | Turbidity (Log10 NTU) |
| `perc_SAV` | float64 | Description not available |
| `SAV_pa` | object | Description not available |
| `precip_1d` | float64 | Description not available |
| `avg_solar_1d` | float64 | Description not available |
| `area_10km` | float64 | Total area of upstream watershed (10-km2) |
**Target Variables (Microbial Source Tracking Markers)**
| Column | Type | Description |
| :--- | :--- | :--- |
| `HF183_pa` | object | Microbial source tracking (MST) marker that indicates human fecal contamination |
| `Rum2Bac_pa` | object | MST marker that indicates ruminant fecal contamination |
| `DG3_pa` | object | MST marker that indicates canid fecal contamination |
| `GFD_pa` | object | MST marker that indicates avian fecal contamination |
## Uses
The dataset was originally used to train machine learning models to predict *E. coli* levels and the presence of specific microbial source tracking markers in agricultural water. It can also be used in research areas such as environmental microbiology, agricultural safety, and spatial ecology.
Use the **"Use this dataset"** button at the top of the page to load the dataset into your preferred library. To load and prepare the data:
```python
import pandas as pd
from datasets import load_dataset
# Load the dataset
ds = load_dataset("food-ai-nexus/fecal-indicator-water-nys")
df = ds["train"].to_pandas()
# Example: Create a binary label for human fecal contamination
# df['target_present'] = (df['HF183_pa'] == 'P').astype(int)
```
## License
This dataset is licensed under the MIT License. It is intended for research and educational use.
## Citation
```bibtex
@article{green2021integrative,
title={Integrative Survey of 68 Non-overlapping Upstate New York Watersheds Reveals Stream Features Associated With Aquatic Fecal Contamination},
author={Green, H. and Wilder, M. and Wiedmann, M. and Weller, D.},
journal={Frontiers in Microbiology},
volume={12},
pages={684533},
year={2021},
doi={10.3389/fmicb.2021.684533}
}
@article{weller2020landscape,
title={Landscape, water quality, and weather factors associated with an increased likelihood of foodborne pathogen contamination of New York streams used to source water for produce production},
author={Weller, D. and Belias, A. and Green, H. and Roof, S. and Wiedmann, M.},
journal={Frontiers in Sustainable Food Systems},
volume={3},
pages={124},
year={2020},
doi={10.3389/fsufs.2019.00124}
}
```
## Source
Original dataset: [Zenodo 10.5281/zenodo.18500867](https://doi.org/10.5281/zenodo.18500867)
Code repository: [wellerd2/Green-et-al.-2021-Datasets](https://github.com/wellerd2/Green-et-al.-2021-Datasets)
提供机构:
food-ai-nexus



