Fern Tree of Life (FTOL) input data

Name: Fern Tree of Life (FTOL) input data
Creator: figshare
Published: 2023-01-18 05:05:15
License: 暂无描述

DataCite Commons2023-01-18 更新2024-07-29 收录

下载链接：

https://figshare.com/articles/dataset/Fern_Tree_of_Life_FTOL_input_data/19474316/3

下载链接

链接失效反馈

官方服务：

资源简介：

The data included here are used in a pipeline that (mostly) automatically generates a maximally sampled fern phylogenetic tree based on plastid sequences in GenBank (https://github.com/fernphy/ftol). The first step is to download the latest release of GenBank data from the NCBI GenBank FTP site (https://ftp.ncbi.nlm.nih.gov/genbank/) and use it to create a local database of fern sequences. This is done with custom R scripts contained in https://github.com/fernphy/ftol, in particular setup_gb.R (https://github.com/fernphy/ftol/blob/main/R/setup_gb.R). Next, a set of reference FASTA files for 79 target loci (one per locus; ref_aln.tar.gz) is generated. These include 77 protein-coding genes based on a list of 83 genes (Wei et al. 2017) that was filtered to only genes that show no evidence of duplication, plus two spacer regions (trnL-trnF and rps4-trnS). Each FASTA file in ref_aln.tar.gz includes one representative (longest) sequence per avaialable fern genus. This is done with prep_ref_seqs_plan.R (https://github.com/fernphy/ftol/blob/main/prep_ref_seqs_plan.R). Sequences matching the target loci are then extracted from each accession in the local database using the FASTA files contained in ref_aln.tar.gz as references with the “Reference_Blast_Extract.py” script of superCRUNCH (Portik and Wiens 2020). The extracted sequences are aligned with MAFFT (Katoh et al. 2002), phylogenetic analysis is done using IQ-TREE (Nguyen et al. 2015) and divergence times estimated with treePL (Smith and O’Meara 2012). For additional methodological details, see: Nitta JH, Schuettpelz E, Ramírez-Barahona S, Iwasaki W. 2022. An open and continuously updated fern tree of life. Frontiers in Plant Sciences 13 https://doi.org/10.3389/fpls.2022.909768.

本数据集包含的数据用于一条（近乎全自动化）的分析流程，该流程可基于基因银行（GenBank）中的质体序列（plastid sequences），生成采样覆盖度最大化的蕨类植物系统发育树（phylogenetic tree），相关项目代码托管于https://github.com/fernphy/ftol。第一步为从美国国家生物技术信息中心（National Center for Biotechnology Information，NCBI）的基因银行（GenBank）FTP站点（https://ftp.ncbi.nlm.nih.gov/genbank/）下载最新版数据，并以此构建本地蕨类序列数据库。该步骤通过https://github.com/fernphy/ftol仓库中的定制R脚本完成，核心脚本为setup_gb.R（https://github.com/fernphy/ftol/blob/main/R/setup_gb.R）。随后，将生成针对79个目标基因座（每个基因座对应一个文件；打包文件为ref_aln.tar.gz）的参考FASTA序列文件集。该文件集包含77个蛋白编码基因，其筛选自83个基因的列表（Wei等，2017），仅保留无复制证据的基因，另外还包含2个间隔区序列（trnL-trnF和rps4-trnS）。ref_aln.tar.gz中的每个FASTA文件，均对应一个现有蕨类属的一条代表性序列（选取长度最长者）。该步骤通过prep_ref_seqs_plan.R脚本（https://github.com/fernphy/ftol/blob/main/prep_ref_seqs_plan.R）完成。随后，以ref_aln.tar.gz中的FASTA文件作为参考序列，利用superCRUNCH工具的Reference_Blast_Extract.py脚本（Portik与Wiens，2020），从本地数据库的每条序列登录记录中提取匹配目标基因座的序列。提取得到的序列将通过MAFFT工具进行多重序列比对（Katoh等，2002），系统发育分析采用IQ-TREE工具完成（Nguyen等，2015），分歧时间估算则借助treePL工具实现（Smith与O’Meara，2012）。如需了解更多方法学细节，请参阅： Nitta JH、Schuettpelz E、Ramírez-Barahona S、Iwasaki W. 2022. 开放且持续更新的蕨类生命之树. 《植物科学前沿》13卷，https://doi.org/10.3389/fpls.2022.909768.

提供机构：

figshare

创建时间：

2022-11-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集