Fern Tree of Life (FTOL) input data
收藏DataCite Commons2023-01-18 更新2024-07-29 收录
下载链接:
https://figshare.com/articles/dataset/Fern_Tree_of_Life_FTOL_input_data/19474316/3
下载链接
链接失效反馈官方服务:
资源简介:
The data included here are used in a pipeline that (mostly) automatically generates a maximally sampled fern phylogenetic tree based on plastid sequences in GenBank (https://github.com/fernphy/ftol).<br> <br> The first step is to download the latest release of GenBank data from the NCBI GenBank FTP site (https://ftp.ncbi.nlm.nih.gov/genbank/) and use it to create a local database of fern sequences. This is done with custom R scripts contained in https://github.com/fernphy/ftol, in particular setup_gb.R (https://github.com/fernphy/ftol/blob/main/R/setup_gb.R).<br> <br> Next, a set of reference FASTA files for 79 target loci (one per locus; ref_aln.tar.gz) is generated. These include 77 protein-coding genes based on a list of 83 genes (Wei et al. 2017) that was filtered to only genes that show no evidence of duplication, plus two spacer regions (trnL-trnF and rps4-trnS). Each FASTA file in ref_aln.tar.gz includes one representative (longest) sequence per avaialable fern genus. This is done with prep_ref_seqs_plan.R (https://github.com/fernphy/ftol/blob/main/prep_ref_seqs_plan.R).<br> <br> Sequences matching the target loci are then extracted from each accession in the local database using the FASTA files contained in ref_aln.tar.gz as references with the “Reference_Blast_Extract.py” script of superCRUNCH (Portik and Wiens 2020).<br> <br> The extracted sequences are aligned with MAFFT (Katoh et al. 2002), phylogenetic analysis is done using IQ-TREE (Nguyen et al. 2015) and divergence times estimated with treePL (Smith and O’Meara 2012).<br> <br> For additional methodological details, see:<br> <br> Nitta JH, Schuettpelz E, Ramírez-Barahona S, Iwasaki W. 2022. An open and continuously updated fern tree of life. Frontiers in Plant Sciences 13 https://doi.org/10.3389/fpls.2022.909768.<br>
本数据集包含的数据用于一条(近乎全自动化)的分析流程,该流程可基于基因银行(GenBank)中的质体序列(plastid sequences),生成采样覆盖度最大化的蕨类植物系统发育树(phylogenetic tree),相关项目代码托管于https://github.com/fernphy/ftol。
第一步为从美国国家生物技术信息中心(National Center for Biotechnology Information,NCBI)的基因银行(GenBank)FTP站点(https://ftp.ncbi.nlm.nih.gov/genbank/)下载最新版数据,并以此构建本地蕨类序列数据库。该步骤通过https://github.com/fernphy/ftol仓库中的定制R脚本完成,核心脚本为setup_gb.R(https://github.com/fernphy/ftol/blob/main/R/setup_gb.R)。
随后,将生成针对79个目标基因座(每个基因座对应一个文件;打包文件为ref_aln.tar.gz)的参考FASTA序列文件集。该文件集包含77个蛋白编码基因,其筛选自83个基因的列表(Wei等,2017),仅保留无复制证据的基因,另外还包含2个间隔区序列(trnL-trnF和rps4-trnS)。ref_aln.tar.gz中的每个FASTA文件,均对应一个现有蕨类属的一条代表性序列(选取长度最长者)。该步骤通过prep_ref_seqs_plan.R脚本(https://github.com/fernphy/ftol/blob/main/prep_ref_seqs_plan.R)完成。
随后,以ref_aln.tar.gz中的FASTA文件作为参考序列,利用superCRUNCH工具的Reference_Blast_Extract.py脚本(Portik与Wiens,2020),从本地数据库的每条序列登录记录中提取匹配目标基因座的序列。
提取得到的序列将通过MAFFT工具进行多重序列比对(Katoh等,2002),系统发育分析采用IQ-TREE工具完成(Nguyen等,2015),分歧时间估算则借助treePL工具实现(Smith与O’Meara,2012)。
如需了解更多方法学细节,请参阅:
Nitta JH、Schuettpelz E、Ramírez-Barahona S、Iwasaki W. 2022. 开放且持续更新的蕨类生命之树. 《植物科学前沿》13卷,https://doi.org/10.3389/fpls.2022.909768.
提供机构:
figshare
创建时间:
2022-11-10



