five

Dataset_S1 Syntenic Halophilic Tribes matrix

收藏
DataONE2012-08-08 更新2024-06-27 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
In order to determine phylogenetic distribution of haloarchaeal genes, a gene presence/absence matrix was constructed by the following process. Independent multi-genome alignments were made for the Haloferax and Haloarcula genera using the whole genome alignment method progressiveMauve [64]. The contigs for each alignment were reordered to match the published genomes of Haloferax volcanii [19] and Haloarcula marismortui [18], respectively, using Mauve’s built-in contig reordering program (Figures S3 and S4). Sets of functionally homologous genes (orthologs), referred to hereafter as Syntenic Halophile Tribes (SHTs), were determined from alignments and joined by the following process. The proteins in each SHT from the Haloferax alignment were searched against all proteins in each SHT from the Haloarcula genomes using BLAST [37] and a bit score for each pair of SHTs was calculated by averaging the bit scores from each BLAST hit. A traditional reciprocal best hit (RBH) BLAST approach was used to produce one-to-one mappings between SHTs in the two genera. Each joined SHT was assigned a function using the most commonly occurring functional annotation of the protein products of the genes in the SHT. This resulted in a set of 398 SHTs present in all nine genomes. Hidden Markov Models (HMMs) were generated for each SHT using HMMER 3, resulting in 13,276 HMMs. The 1,303 completed archaeal and bacterial genomes available from NCBI as of March 15, 2011 were downloaded and a single genome from each genus selected at random, resulting in 396 genomes. Each SHT HMM was searched against these 396 genomes and the eight halophile genomes generated for this study using HMMER 3. Each gene was counted as belonging to the HMM if it had an E-value below 0.0001 and the hit covered greater than 80% of the length of both the gene and the HMM. If a gene hit more than one HMM it was counted only for the HMM with the best E-value. These hits were then used to generate a 13,276 x 405 presence/absence matrix. The genomes and HMMs were clustered using the ‘ctc’ library in R [65] with manhattan distance and complete linkage clustering. The clustering was viewed with the Java Treeview program [66]. Cluster file can be accessed at our website [60] and as Dataset S1 and Figure S5.
创建时间:
2012-08-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作