Online Appendix and Cetacean Datasets for: The Occurrence Birth-Death Process for combined-evidence analysis in macroevolution and epidemiology

NIAID Data Ecosystem2026-03-13 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.p8cz8w9rq

下载链接

链接失效反馈

官方服务：

资源简介：

Phylodynamic models generally aim at jointly inferring phylogenetic relationships, model parameters, and more recently, the number of lineages through time, based on molecular sequence data. In the fields of epidemiology and macroevolution these models can be used to estimate, respectively, the past number of infected individuals (prevalence) or the past number of species (paleodiversity) through time. Recent years have seen the development of “total-evidence” analyses, which combine molecular and morphological data from extant and past sampled individuals in a unified Bayesian inference framework. Even sampled individuals characterized only by their sampling time, i.e. lacking morphological and molecular data, which we call occurrences, provide invaluable information to reconstruct the past number of lineages. Here, we present new methodological developments around the Fossilized Birth-Death Process enabling us to (i) incorporate occurrence data in the likelihood function; (ii) consider piecewise-constant birth, death and sampling rates; and (iii) reconstruct the past number of lineages, with or without knowledge of the underlying tree. We implement our method in the RevBayes software environment, enabling its use along with a large set of models of molecular and morphological evolution, and validate the inference workflow using simulations under a wide range of conditions. We finally illustrate our new implementation using two empirical datasets stemming from the fields of epidemiology and macroevolution. In epidemiology, we infer the prevalence of the COVID-19 outbreak on the Diamond Princess ship, by taking into account jointly the case count record (occurrences) along with viral sequences for a fraction of infected individuals. In macroevolution, we infer the diversity trajectory of cetaceans using molecular and morphological data from extant taxa, morphological data from fossils, as well as numerous fossil occurrences. The joint modeling of occurrences and trees holds the promise to further bridge the gap between between traditional epidemiology and pathogen genomics, as well as paleontology and molecular phylogenetics. Methods Online Appendix : available in the `Related Works` section Cetacean Datasets : [copied from the subsection *Material and methods* > *Cetacean data analysis* > *Molecular, morphological and occurrence datasets* of the main paper] The data can be subdivided in three parts: molecular, morphological, and occurrences. Datasets were collected and analysed separately and are stored on the Open Science Framework (https://osf.io) ([dataset] Aguirre-Fern´andez et al., 2020). Molecular data comes from Steeman et al. (2009), and comprises 6 mitochondrial and 9 nuclear genes, for 87 of the 89 accepted extant cetacean species. Morphological data was obtained from Churchill et al. (2018), the most recent version of a widely-used dataset first produced by Geisler and Sanders (2003). After merging 2 taxa that are now considered synonyms on the Paleobiology Database (PBDB) and removing 3 outgroups that would have violated our model’s assumptions, it now contains 327 variable morphological characters for 27 extant and 90 fossil taxa (mostly identified at the species level but 21 remain undescribed). In order to speed up the analysis we further excluded the undescribed specimens and reduced this dataset to the generic level by selecting the most complete specimen in each genera. Indeed, the computing cost increases quadratically with the maximum number of hidden lineages N, to the point of becoming the bottleneck in our MCMC when N > 100. Given that a mid-Miocene peak diversity between 100 and 220 species is expected (Quental and Marshall, 2010), with less than 100 observed lineages in our inferred tree at that time, N should therefore be about 150. Inferring instead the tree of cetacean genera allows us to reduce N to 70 hidden lineages. The final dataset thus contains 41 extant and 62 extinct genera. Occurrences come from the PBDB (data archive 9, M. D. Uhen) on May 11, 2020. The dataset initially consisted of all 4678 cetacean occurrences, but the cetacean fossil record is known to be subject to several biases (Uhen and Pyenson, 2007; Marx et al., 2016; Dominici et al., 2020). A detailed exploration (see Online Appendix E) of this occurrence dataset revealed several notable biases. First, an artefactual cluster of occurrences in very recent times, combined with other expected Pleistocene biases (Dominici et al., 2020), led us to remove all Late Pleistocene and Holocene occurrences. Second, we detected substantial variations in fossil recovery per time unit across lineages (see Fig. S10) resulting from oversampling of some species and localities, 295 possibly due to greater abundance or spatio-temporal biases (Dominici et al., 2020). This observation violates our assumption of identical fossil sampling rates among taxa during a given interval. In order to reduce this bias, we retained occurrences identified at the genus level and further aggregated all occurrences belonging to an identical genus found at the same geological formation. In the case of occurrences for which the geological formation was not specified, we used geoplate data combined with stratigraphic interval as a proxy for geological formation. This resulted in a total of 968 occurrences retained for the analysis.

系统发育动力学模型（Phylodynamic models）通常旨在基于分子序列数据，联合推断系统发育关系、模型参数，以及近年来新增的随时间变化的支系数量。在流行病学和宏演化领域中，此类模型可分别用于估算随时间变化的既往感染个体数（即患病率，prevalence）或物种类群数（即古生物多样性，paleodiversity）。近年来，学界已发展出“总证据分析”（total-evidence analyses）方法，该方法在统一的贝叶斯推断框架下，整合现生及历史采样个体的分子与形态学数据。即便仅以采样时间为特征、缺乏形态与分子数据的采样个体（我们将其称为出现记录（occurrences）），也可为重建既往支系数量提供极为宝贵的信息。本研究围绕化石化出生-死亡过程（Fossilized Birth-Death Process）提出了全新的方法学进展，可实现以下三项功能：(i) 将出现记录数据纳入似然函数（likelihood function）；(ii) 考虑分段恒定的新生、灭绝与采样速率；(iii) 无论是否已知系统发育树的拓扑结构，均可重建既往支系数量。我们将该方法集成于RevBayes软件环境中，可使其与一系列分子及形态演化模型协同使用，并通过多种条件下的模拟实验验证了推断工作流的有效性。最后，我们分别借助流行病学与宏演化领域的两个实证数据集，展示了该新实现方法的应用效果。在流行病学方面，我们联合利用病例计数记录（即出现记录）与部分感染个体的病毒序列数据，推断了钻石公主号（Diamond Princess）上新型冠状病毒肺炎暴发期间的患病率。在宏演化方面，我们利用现生类群的分子与形态学数据、化石形态学数据，以及大量化石出现记录，推断了鲸类（cetaceans）的多样性动态轨迹。联合建模出现记录与系统发育树，有望进一步弥合传统流行病学与病原基因组学、古生物学与分子系统发育学之间的鸿沟。 ## 方法在线附录：可在“相关研究（Related Works）”板块获取鲸类数据集：[摘抄自主论文“材料与方法（Material and methods）”>“鲸类数据分析（Cetacean data analysis）”>“分子、形态学与出现记录数据集（Molecular, morphological and occurrence datasets）”小节] 该数据集可分为三部分：分子数据、形态学数据与出现记录数据。所有数据集均单独收集与分析，并存储于开放科学框架（Open Science Framework，OSF，https://osf.io）中（数据集来源：Aguirre-Fernández等，2020）。分子数据取自Steeman等（2009）的研究，涵盖89种已确认现生鲸类中的87种，包含6个线粒体基因与9个核基因。形态学数据取自Churchill等（2018）的研究，该数据集是Geisler与Sanders（2003）最初构建的经典数据集的最新版本。我们在古生物学数据库（Paleobiology Database，PBDB）中将2个被视为同物异名的类群合并，并移除了3个违反模型假设的外类群后，当前数据集包含27个现生类群与90个化石类群的327个可变形态学性状（多数类群已鉴定至物种水平，但仍有21个类群尚未正式描述）。为加速分析流程，我们进一步移除了尚未正式描述的标本，并通过选取每个属中最完整的标本，将数据集降至属级水平。实际上，计算成本随最大隐藏支系数N呈二次方增长，当N>100时，计算量将成为马尔可夫链蒙特卡洛（Markov Chain Monte Carlo，MCMC）运算的瓶颈。根据Quental与Marshall（2010）的研究，中新世中期的物种多样性峰值预计为100~220种，而我们当时推断的系统发育树中观测到的支系数不足100，因此N应设置为约150。若改为推断鲸类属级系统发育树，则可将N降至70个隐藏支系。最终数据集包含41个现生属与62个灭绝属。出现记录数据于2020年5月11日取自PBDB（数据存档编号9，M. D. Uhen）。初始数据集包含全部4678条鲸类出现记录，但已知鲸类化石记录存在多种偏差（Uhen与Pyenson，2007；Marx等，2016；Dominici等，2020）。对该出现记录数据集的详细分析（详见在线附录E）揭示了若干显著偏差：其一，近期出现了人为聚集的记录，结合更新世的预期偏差（Dominici等，2020），我们移除了所有晚更新世与全新世的出现记录；其二，我们检测到不同支系间单位时间内的化石回收率存在显著差异（见图S10），这源于部分物种与地点的过度采样，可能是种群数量更高或时空偏差导致（Dominici等，2020）。该现象违背了我们“给定时间区间内不同类群的化石采样速率一致”的假设。为降低此类偏差，我们仅保留鉴定至属级的出现记录，并进一步将同一属且产自同一地质地层的出现记录进行合并。对于未明确标注地质地层的出现记录，我们结合古地理板块数据与地层间隔作为地质地层的替代指标。最终共保留968条出现记录用于分析。

创建时间：

2022-04-25