five

Phenological patterns of tropical mountain forest trees across the neotropics: Evidence from herbarium specimens

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.08kprr59w
下载链接
链接失效反馈
官方服务:
资源简介:
The flowering phenology of many Tropical Mountain Forest tree species remains poorly understood, including flowering synchrony and its drivers across Neotropical ecosystems. We obtained herbarium records for 427 tree species from a long-term monitoring transect on the north-western Ecuadorian Andes, sourced from the Global Biodiversity Information Facility (GBIF) and the Herbario Nacional del Ecuador (QCNE). Using machine learning algorithms, we identified flowering phenophases from digitized specimen labels and applied circular statistics to build phenological calendars across six climatic regions within the Neotropics. We found 47,939 herbarium records, of which 14,938 were classified as flowering by Random Forest Models. We constructed phenological calendars for six regions and 86 species with at least 20 flowering records across the 6 regions. Phenological patterns varied considerably across regions; among species within regions; and within species across regions. There was limited interannual synchronicity in flowering patterns within regions primarily driven by bimodal species whose flowering peaks coincided with irradiance peaks. The predominantly high variability of phenological patterns among species and within species likely confers adaptative advantages by reducing interspecific competition during reproductive periods and promoting species coexistence in highly diverse regions with little or no seasonality. Methods Species selection and retrieval of herbarium records We selected species from tropical mountain forests on the north-western slope of the Ecuadorian Andes, using data on tree inventories from 16 permanent plots from the ‘Pichincha long-term forest dynamics and carbon monitoring transect’. The transect covers forests between 600-3500 m asl, at the equator (latitude 0°11.32’ N – 0°7.6’ S) characterized by a high tree alpha and β diversity. The initial flora list included 516 unique taxa that included species unequivocally identified to the level of subspecies, species, and genus and 82 taxa with ambiguous identification at the species level (conferatur, or affinis). From these 598 taxa, we eliminated duplicates (n=35 entries identified as conferatur o affinis that were already in the list of 516 taxa), and entries only identified to genus level (n=123). Finally, for 8 taxa that were identified to the level of subspecies or varieties, we added 8 entries that were only identified to the species level, to increase the chance of finding suitable herbarium specimens (for instance for the entry “Aegiphila lopez-palacii var. pubescens” we added an entry “Aegiphila lopez-palacii”). The final species list included 444 species from 80 families (See Supplementary material S1). All species’ names were validated based on the Checklist of the Vascular Plants of the Americas. For each species, we retrieved their synonyms from the TROPICOS database (https://www.tropicos.org/home, accessed on 07/30/2022) using the taxize R package. The final list of 2,908 entries of the original species names and their synonyms was used to search herbarium specimens. We searched for herbarium specimens in the GBIF - Global Biodiversity Information Facility – database using the GBIF API (https://api.gbif.org/v1/). The search parameters matched our species list to names in the GBIF backbone. We applied filters to retrieve only specimens that: (1) had complete geographical coordinates and no GBIF-identified geospatial issues, (2) had complete dates or dates with at least the month and year, (3) corresponded to locations within the Neotropics: Latitude between -23S and 23N and Longitude between -160W and -20E, and (4) had information in at least one of the columns with field notes (i.e., "fieldNotes", "occurrenceRemarks", and “dynamicProperties” according to the Darwin Core Standard from GBIF). The original GBIF dataset had 54,146 specimen records. We cleared duplicated from the initial dataset; we found 5,886 actual duplicates in the dataset using the search fields scientific name, collector name, year, latitude, and longitude. We also removed records corresponding to GBIF-added subspecies and varieties that did not have the species name as a synonym in TROPICOS, and records that only listed field notes in the “dynamicProperties” column. The total number of records in the final GBIF dataset was 41,004. We also retrieved data from the Herbario Nacional del Ecuador (QCNE) - Instituto Nacional de Biodiversidad (INABIO https://bndb.sisbioecuador.bio/bndb/collections), rendering an initial dataset of 10,881 records. The cleaning and filtering protocol described above for the GBIF records was also applied to the QCNE dataset, obtaining a dataset of 6,935 records. We merged the GBIF and QCNE datasets into one and made a final check for duplicates and potential errors or incomplete data in dates of collection and species names. Lastly, we merged the three columns with field-notes information into one column used as input to run the machine learning models (see below). The final dataset included 47,939 unique records corresponding to 427 species (from 80 families) across the Neotropics (Supplementary material S2). The dataset covered the period 1821 to 2022, but most records (89.5%) were gathered from 1980 onwards (Supplementary material S3). Machine learning approaches to determine phenological status We used natural language processing (NLP), a machine learning algorithm, to determine the phenological status of each specimen based on the information in the field notes, as this commonly contains words related to phenological information (‘flowers’, ‘buds’, among others). First, we created a training and evaluation dataset of 3,000 specimen records to compare the performance of different machine-learning models and select the best. We selected the records for this dataset from our final dataset by applying a stratified sampling considering the year, latitude, and longitude of all specimens with links to images. We visually checked the agreement between images and field note labels to assess whether labels that included flowering information, corresponded to a flowering specimen. Only 1,913 specimens had valid links to images, of which 97% contained information about flowering on the label and had a good correspondence to images (80% of the flowering labels). Next, we cleaned the field notes by removing the punctuation, numbers, special symbols, and certain repetitive expression that were not informative (i.e. “na”, “ca”, “PORT US”, etc.). Then, we used the Natural Language Toolkit (NLTK) Python package [30] to delete the stop words from different languages, including Spanish, English, Portuguese, and French. Since machine learning algorithms usually require matrices of numbers as their input, we converted our text data from field notes into a numerical matrix using the “bag of words” method (the method describes the occurrence of words within a text). This vectorization method consists of splitting the text into single words and getting the frequency of each word in a piece of text. The “bag of words” output is a numerical matrix in which columns are words from the training dataset, rows are the observations from the training dataset, and each cell is the number of times a word appears in a particular observation. We applied the “bag of words” method to our training and evaluation dataset using “CountVectorizer” from the scikit-learn Python package. Finally, we evaluated three different approaches to predict whether a specimen was flowering from field notes data. First, we created a baseline model for flowering using the scikit-learn “DummyClassifier”, which is a simple classifier that always predicts the most frequent class in the data. Then, we applied the naïve Bayes and random forest algorithms (RFM) for the same purposes, applying  “GaussianNB” and “RandomForestClassifier” from scikit-learn. We estimated the performance of the models using 5-fold cross-validation and evaluated them using five metrics: accuracy, the total proportion of ‘flowering’ and ‘not flowering’ predictions that were correct; precision, the proportion of ‘flowering’ predictions that were correct; recall, the proportion of true ‘flowering’ records correctly predicted; ROC-AUC which quantifies the ability of a binary classifier to distinguish between flowering and non-flowering classes, and F1, the harmonic mean of precision and recall. Once we determined the best performance model, we retrained it on the entire training and evaluation dataset and used it to predict flowering for all records in the final dataset (n = 47,939). We followed the same cleaning procedure for the whole dataset as the one applied to the training and evaluation dataset. For all subsequent analyses, we considered only records predicted to be flowering (n = 14,938).
创建时间:
2025-01-27
二维码
社区交流群
二维码
科研交流群
商业服务