five

Fern Tree of Life (FTOL) input data

收藏
figshare.com2024-10-30 更新2025-03-26 收录
下载链接:
https://figshare.com/articles/dataset/Fern_Tree_of_Life_FTOL_input_data/19474316/9
下载链接
链接失效反馈
官方服务:
资源简介:
The data included here are used in a pipeline that (mostly) automatically generates a maximally sampled fern phylogenetic tree based on plastid sequences in GenBank (https://github.com/fernphy/ftol). The first step is to download the latest release of GenBank data from the NCBI GenBank FTP site (https://ftp.ncbi.nlm.nih.gov/genbank/) and use it to create a local database of fern sequences. This is done with custom R scripts contained in https://github.com/fernphy/ftol, in particular setup_gb.R (https://github.com/fernphy/ftol/blob/main/R/setup_gb.R). Next, a set of reference FASTA files for 79 target loci (one per locus; ref_aln.tar.gz) is generated. These include 77 protein-coding genes based on a list of 83 genes (Wei et al. 2017) that was filtered to only genes that show no evidence of duplication, plus two spacer regions (trnL-trnF and rps4-trnS). Each FASTA file in ref_aln.tar.gz includes one representative (longest) sequence per avaialable fern genus. This is done with prep_ref_seqs_plan.R (https://github.com/fernphy/ftol/blob/main/prep_ref_seqs_plan.R). Sequences matching the target loci are then extracted from each accession in the local database using the FASTA files contained in ref_aln.tar.gz as references with the “Reference_Blast_Extract.py” script of superCRUNCH (Portik and Wiens 2020). The extracted sequences are aligned with MAFFT (Katoh et al. 2002), phylogenetic analysis is done using IQ-TREE (Nguyen et al. 2015) and divergence times estimated with treePL (Smith and O’Meara 2012). For additional methodological details, see: Nitta JH, Schuettpelz E, Ramírez-Barahona S, Iwasaki W. 2022. An open and continuously updated fern tree of life. Frontiers in Plant Sciences 13 https://doi.org/10.3389/fpls.2022.909768.

本数据集所包含的内容用于构建一个(主要)自动生成的、采样最大化的大叶蕨类系统发育树,该树基于GenBank(https://github.com/fernphy/ftol)中的叶绿体序列。首先,需从NCBI GenBank FTP站点(https://ftp.ncbi.nlm.nih.gov/genbank/)下载最新版本的GenBank数据,并利用之构建一个本地蕨类序列数据库。此过程通过包含在https://github.com/fernphy/ftol中的自定义R脚本完成,特别是setup_gb.R(https://github.com/fernphy/ftol/blob/main/R/setup_gb.R)脚本。接下来,生成了79个目标位点的参考FASTA文件集(每个位点一个;ref_aln.tar.gz)。这些文件包括基于83个基因列表(Wei等,2017)的77个蛋白质编码基因,该列表经过筛选,仅包含无重复证据的基因,以及两个间隔区域(trnL-trnF和rps4-trnS)。ref_aln.tar.gz中的每个FASTA文件包含每个可用的蕨类属的代表序列(最长序列)。此步骤通过prep_ref_seqs_plan.R(https://github.com/fernphy/ftol/blob/main/prep_ref_seqs_plan.R)脚本完成。然后,使用superCRUNCH(Portik和Wiens,2020)中的“Reference_Blast_Extract.py”脚本,从本地数据库的每个访问号中提取与目标位点匹配的序列。提取的序列使用MAFFT(Katoh等,2002)进行对齐,系统发育分析使用IQ-TREE(Nguyen等,2015)进行,并使用treePL(Smith和O’Meara,2012)估计分歧时间。关于额外的方法论细节,请参阅:Nitta JH,Schuettpelz E,Ramírez-Barahona S,Iwasaki W. 2022. 开放且持续更新的蕨类生命树。植物科学前沿 13 https://doi.org/10.3389/fpls.2022.909768。
提供机构:
figshare.com
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作