Additional file 4 of Single nucleotide polymorphism discovery from expressed sequence tags in the waterflea Daphnia magna
收藏DataCite Commons2020-08-27 更新2024-07-28 收录
下载链接:
https://springernature.figshare.com/articles/dataset/Additional_file_4_of_Single_nucleotide_polymorphism_discovery_from_expressed_sequence_tags_in_the_waterflea_Daphnia_magna/12873026
下载链接
链接失效反馈官方服务:
资源简介:
Additional file 4:Summary of the gene annotation of the EST sequences. In this file we report the gene annotation for three set of sequences based on BLAST searches in NCBI and in the Daphnia portal ( http://wfleabase.org/ ), called wfleabase in the remaining text): 1) ESTs generated for this study exposing animals to three key environmental stressors and using suppressive subtractive hybridization. The results for this set of sequences are summarized in the spreadsheets EST_1070_NCBI and EST_1070_wfleabase_aa. In EST_1070_NCBI we summarize the gene annotation results obtained from BLAST searches in the NCBI non-redundant protein database using the program tblastx. In EST_1070_wfleabase_aa we summarize the results obtained from BLAST searches in the non-redundant protein database of the Daphnia portal (wfleabase) using the program tblastx. 2) Contigs obtained by assembling EST sequences produced in this study (see point 1 above) and sequences of Daphnia magna downloaded from NCBI GenBank at the time of the analysis. The results for this set of sequences are summarized in the spreadsheets Contigs_NCBI_1812, Contigs_wfleabase_aa_1812, and Contigs_wfleabase_na_1812. In Contigs_NCBI_1812 we summarize the gene annotation results obtained from BLAST searches in the NCBI non-redundant protein database using the program tblastx. In Contigs_wfleabase_aa_1812, and Contigs_wfleabase_na_1812 we summarize the results obtained from BLAST searches in the non-redundant protein database and in the nucleotide database of the Daphnia portal (wfleabase) using the programs tblastx and tblastn, respectively. 3) Contigs obtained from clusters of sequences mined for SNP markers. The number of contigs mined for SNPs is lower than the total number of contigs including our sequences and sequences from GenBank (point 2 above) as several stringent criteria were adopted to select them (see Methods). The results for this set of sequences are summarized in the spreadsheets Contigs_NCBI_574, Contigs_wfleabase_aa_574, and Contigs_wfleabase_na_574. Results from BLAST searches were obtained as in point 2 of this table legend. Columns ID in the described spreadsheets are as follows: 1) SID: sequence identity; 2) GOID - Gene ontology term identity; 3) PID - Protein identity as from BLAST searches; 4) P_desc - Gene description as from BLAST searches and indication of the species where it was identified; 5) e-value - significant homology between the sequence query and the hit in NCBI; 6) Paralog - the paralog group identity (several members may be shown); 7) Start-End: FrameFS - open reading frames predictor results with indication of the start and end coordinates and the frame; 8) DomainID:desc - protein site scan domain identity and description of the protein domain; 9) length - length of the EST; 10) OG_ID - group identity of the ortholog group of protein sequences. This analysis is based on searches for orthologs in several genomes; 11) E-value - significant homology to the ortholog group of protein sequences; 12) Score - score for the ortholog group of protein sequences analysis. The columns ID from 1 to 12 can be found in the spreadsheets: EST_1070_NCBI, Contigs_NCBI_1812, and Contigs_NCBI_574. In the remaining spreadsheets the following columns ID are present: 1) query id - query identity; 2) database sequence (subject) id - sequence identity in wfleabase; 3) gene id - gene identity in wfleabase; 4) percent identity - percentage of identity between query and the gene in wfleabase; 5) alignment length - match in bp between the query and the gene in wfleabase; 6) number of mismatches - number of mismatches between the query and the gene in wfleabase; 7) number of gap openings - gap openings between the query and the gene in wfleabase; 8) query start; 9) query end; 10) subject start - database sequence (subject) start; 11) subject end - database sequence (subject) end; 12) Expect value-E-value of the match between the query and the subject; 13) HSP bit score - blastp e-value score; 14) Gene_ID - gene identity in wfleabase; 15) Gname - gene name; 16) Gnomon - gene prediction in NCBI; 17) Paralog; 18) Paralog,# - number of paralogs identified; 19) OrthoID - ortholog identity; 20) ArpGene - homology to the arthropod genes list; 21) ArpDE - arthropod genes description; 22) Scaffold - scaffold number where the query was annotated; 23) Begin - query start on the scaffold; 24) End - query end on the scaffold; 25) Or - orphan gene; 26) KOG_JGI - ortholog and paralog proteins identities provided for a JGI-sequenced organism; 27) KOG_EMBL - ortholog and paralog proteins identities provided in the EMBL database; 28) meNOG_EMBL - evolutionary genealogy of genes; 29) Enzyme_JGI - protein identity reported in JGI; 30) Enzyme_JGI - protein identity reported in EMBL; 31) Description_JGI - protein description based on JGI database; 32) GeneOntology_JGI - Gene ontology as described in the JGI database; 33) Tandem_ID - identity of tandem genes arrangements. The columns ID are listed in the column_IDs spreadsheet. (XLS 3 MB)
附加文件4:表达序列标签(EST,Expressed Sequence Tag)序列基因注释汇总。本文件汇总了基于NCBI及水蚤门户(http://wfleabase.org/,下文简称wfleabase)的BLAST比对结果,针对三组序列开展基因注释:1)本研究中构建的表达序列标签:将实验动物暴露于三种关键环境胁迫因子,并通过抑制性消减杂交(suppressive subtractive hybridization)技术获取。该组序列的分析结果汇总于表格文件EST_1070_NCBI与EST_1070_wfleabase_aa。其中,EST_1070_NCBI汇总了通过tblastx程序在NCBI非冗余蛋白数据库中进行BLAST比对得到的基因注释结果;EST_1070_wfleabase_aa则汇总了通过tblastx程序在水蚤门户(wfleabase)非冗余蛋白数据库中进行BLAST比对得到的结果。2)通过组装本研究获取的表达序列标签(详见上述第1点)以及分析阶段从NCBI GenBank下载的大型溞(Daphnia magna)序列得到的重叠群(contig)。该组序列的分析结果汇总于表格文件Contigs_NCBI_1812、Contigs_wfleabase_aa_1812及Contigs_wfleabase_na_1812。其中,Contigs_NCBI_1812汇总了通过tblastx程序在NCBI非冗余蛋白数据库中进行BLAST比对得到的基因注释结果;Contigs_wfleabase_aa_1812与Contigs_wfleabase_na_1812则分别汇总了通过tblastx、tblastn程序在水蚤门户(wfleabase)的非冗余蛋白数据库与核苷酸数据库中进行BLAST比对得到的结果。3)针对单核苷酸多态性(SNP,Single Nucleotide Polymorphism)标记挖掘得到的序列簇所构建的重叠群。由于筛选时采用了多项严格标准(详见方法部分),用于SNP挖掘的重叠群数量少于包含本研究序列与GenBank序列的总重叠群数(详见上述第2点)。该组序列的分析结果汇总于表格文件Contigs_NCBI_574、Contigs_wfleabase_aa_574及Contigs_wfleabase_na_574,其BLAST比对结果获取方式同本表格说明第2点。所述表格中的ID列含义如下:1) SID:序列标识;2) GOID:基因本体(Gene Ontology)术语标识;3) PID:BLAST比对得到的蛋白标识;4) P_desc:BLAST比对得到的基因描述,以及该基因被鉴定所在的物种信息;5) e-value:查询序列与NCBI比对命中序列间的显著同源性分值;6) Paralog:旁系同源组标识(可展示多个成员);7) Start-End: FrameFS:开放阅读框预测结果,标注起始与终止坐标及阅读框;8) DomainID:desc:蛋白位点扫描结构域标识与蛋白结构域描述;9) length:表达序列标签(EST)的长度;10) OG_ID:蛋白序列直系同源组的组标识,本分析基于多个基因组中的直系同源基因搜索完成;11) E-value:与蛋白序列直系同源组间的显著同源性分值;12) Score:蛋白序列直系同源组分析的分值。上述第1至12列的ID可在表格文件EST_1070_NCBI、Contigs_NCBI_1812及Contigs_NCBI_574中找到。其余表格包含以下ID列:1) query id:查询序列标识;2) database sequence (subject) id:wfleabase中的序列标识;3) gene id:wfleabase中的基因标识;4) percent identity:查询序列与wfleabase中基因的序列相似度百分比;5) alignment length:查询序列与wfleabase中基因的比对长度(碱基对,bp);6) number of mismatches:查询序列与wfleabase中基因的错配碱基数目;7) number of gap openings:查询序列与wfleabase中基因的间隙开放数目;8) query start:查询序列起始位置;9) query end:查询序列终止位置;10) subject start:数据库序列(目标序列)起始位置;11) subject end:数据库序列(目标序列)终止位置;12) Expect value:查询序列与目标序列比对的E-value分值;13) HSP bit score:BLASTP比对的比特分值;14) Gene_ID:wfleabase中的基因标识;15) Gname:基因名称;16) Gnomon:NCBI的基因预测结果;17) Paralog:旁系同源;18) Paralog,#:鉴定得到的旁系同源基因数目;19) OrthoID:直系同源标识;20) ArpGene:与节肢动物基因列表的同源性;21) ArpDE:节肢动物基因描述;22) Scaffold:查询序列注释所在的支架序列编号;23) Begin:查询序列在支架序列上的起始位置;24) End:查询序列在支架序列上的终止位置;25) Or:孤儿基因标识;26) KOG_JGI:JGI测序生物提供的直系与旁系同源蛋白标识;27) KOG_EMBL:EMBL数据库中提供的直系与旁系同源蛋白标识;28) meNOG_EMBL:基因进化谱系;29) Enzyme_JGI:JGI数据库中报道的蛋白标识;30) Enzyme_JGI:EMBL数据库中报道的蛋白标识;31) Description_JGI:基于JGI数据库的蛋白描述;32) GeneOntology_JGI:JGI数据库中记载的基因本体信息;33) Tandem_ID:串联基因排列的标识。所有ID列均汇总于column_IDs表格文件中。(XLS格式,大小3 MB)
提供机构:
figshare
创建时间:
2020-08-27



