Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models

NIAID Data Ecosystem2026-03-13 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.8pk0p2np4

下载链接

链接失效反馈

官方服务：

资源简介：

Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. This study examined the extent to which cleaned data from six pipelines using data cleaning tools (e.g., the GBIF web application, different R packages) affect downstream species distribution models. In addition, we assessed how the pipeline data differ from expert data. From 13,889 North American Ephedra observations in GBIF, the pipelines removed 31.7% to 62.7% false-positives, invalid coordinates, and duplicates, leading to data sets that included between 9,484 (GBIF application) and 5,196 records (manual-guided filtering). The expert data consisted of 703 thoroughly handpicked records, comparable to data from field studies. Although differences in the record numbers were relatively large, stacked species distribution models (sSDM) from the pipelines and the expert data were strongly related (mean Pearson's r across the pipelines: 0.9986, versus the expert data: 0.9173). The ever-stronger correlations resulted from occurrence information that became increasingly condensed in the course of the workflow (from individual occurrences to collectivized occurrences in grid cells to predicted probabilities in the sSDMs). In sum, our results suggest that the R package-based pipelines reliably identified invalid coordinates. In contrast, the GBIF-filtered data still contained both spatial and taxonomic errors. However, major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of expert taxonomic knowledge. We conclude that application-filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving high-quality taxonomic data will require extra effort, probably by thoroughly analyzing the data for misidentified taxa, supported by experts. Methods The North American Ephedra records were selected from a New World Ephedra data set, courtesy of Professor Ickert-Bond. Reviewed herbarium vouchers and observations were the basis of the overall data set.

全球生物多样性信息设施（Global Biodiversity Information Facility, GBIF）与其他数据提供者提供的数字化点位发生记录，可为宏生态学与生物地理学领域的各类研究提供支撑。然而，数据错误会妨碍这些数据的直接使用。手动数据清洗不仅耗时，且面对动辄数十万乃至数百万条记录的数据库时往往不可行，因此自动化数据清洗流水线的重要性日益凸显。本研究探究了六种采用各类数据清洗工具（如GBIF网页应用、多款R包）的清洗流水线所生成的净化数据，对后续物种分布模型的影响程度。此外，本研究还评估了经清洗流水线处理的数据与专家审定数据之间的差异。从GBIF数据库中13,889条北美麻黄属（Ephedra）观测记录出发，各清洗流水线共剔除了31.7%至62.7%的假阳性记录、无效坐标及重复条目，最终得到的数据集记录数介于9,484条（GBIF网页应用处理后）与5,196条（手动引导式过滤后）之间。专家审定数据集包含703条经过严格手工筛选的记录，其质量可与野外调查获取的数据相媲美。尽管数据集间的记录数量差异较大，但经清洗流水线处理的数据与专家审定数据所生成的堆叠物种分布模型（stacked species distribution models, sSDM）却呈现出极强的相关性（各流水线的平均皮尔逊相关系数r为0.9986，专家审定数据的相关系数r为0.9173）。相关性随流程推进不断增强，这源于发生记录信息在整个工作流中被逐步凝练：从单个发生记录，到网格单元内的聚合发生记录，最终转化为sSDM中的预测概率。综上，本研究结果表明，基于R包的清洗流水线能够可靠地识别出无效坐标。与之相对，经GBIF过滤后的数据仍同时存在空间与分类学层面的错误。然而，所有清洗流水线均无法在不借助专家分类学知识的前提下，完全识别出鉴定错误的标本，这是其主要缺陷。我们得出结论：经网页应用过滤后的GBIF数据仍需进行额外审核，以进一步提升其空间数据质量；而要获取高质量的分类学数据，则需要投入更多精力，通常需在专家支持下，对数据中的错误鉴定类群开展全面分析。 ## 研究方法北美麻黄属记录源自Ickert-Bond教授提供的新世界麻黄属数据集。经审核的标本馆凭证标本与观测记录构成了本研究整体数据集的基础。

创建时间：

2022-07-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集