Global soil pollution by toxic metals threatens agriculture and human health
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.83bk3jb2z
下载链接
链接失效反馈官方服务:
资源简介:
This dataset includes the attachments related to the research conducted by Hou et al., titled "Global soil pollution by toxic metals threatens agriculture and human health". Our study involved the compilation of a comprehensive global dataset on soil contamination by arsenic, cadmium, cobalt, chromium, copper, nickel and lead, sourced from investigations encompassing sampling locations across various climate zones, geological formations, and land usage patterns. Subsequently, advanced machine learning methodologies were employed to identify regions with exceedance of agricultural and human health thresholds. This research underscores the critical impact of soil pollution on global food security and the imperative alignment with Sustainable Development Goals.
Methods
In this study, we synthesize a global soil concentration dataset for seven toxic metals: arsenic (As), cadmium (Cd), cobalt (Co), chromium (Cr), copper (Cu), nickel (Ni), and lead (Pb). This exhaustive compilation involved data from 1493 regional studies, encompassing 796,084 sampling points across a wide range of climate zones, geological settings, and land use types. Then, a series of covariates associated with soil contamination were used to construct predictive models for the distribution of toxic metal exceedance, including geological variables, climate variables, soil texture and basic physico-chemical properties, topography, and socioeconomic variables. All variables were resampled and reprojected to match the 10 km resolution grid of the toxic metal distribution.
The dataset was randomly divided into training and test sets in an 8:2 ratio. In order to avoid overfitting and multicollinearity, and improve model performance and interpretability, feature selection was conducted with the combination of recursive feature elimination and Pearson correlation analyses. After a preliminary comparison of ten machine learning algorithms, extremely randomized trees (ERT) emerged as the top-performing model for further refinement. Grid search was used to optimize the key hyperparameters in ERT, including the number of trees, the maximum depth of the tree, the minimum number of samples required to be at a leaf node, the minimum number of samples required to split an internal node and the function to measure the quality of a split. The optimized model was subsequently evaluated by a range of metrics, such as balanced accuracy (BA), sensitivity, specificity, F1 score, average precision (AP), the area under the Receiver Operating Characteristic Curve (AUC), and Cohen's kappa coefficient (KIA). Then, the optimized models were utilized to predict pollution probabilities for 2,000,000 pixels, enabling the generation of global pollution probability maps for different toxic metals. Areas covered by desert and permafrost were excluded, resulting in a final dataset of 1,290,000 remaining pixels.
创建时间:
2025-04-17



