five

Additional file 2: of A method for identification of highly conserved elements and evolutionary analysis of superphylum Alveolata

收藏
DataCite Commons2024-12-17 更新2024-07-27 收录
下载链接:
https://springernature.figshare.com/articles/dataset/Additional_file_2_of_A_method_for_identification_of_highly_conserved_elements_and_evolutionary_analysis_of_superphylum_Alveolata/4352090/1
下载链接
链接失效反馈
官方服务:
资源简介:
Presents in detail all clusters found in the final graph by our algorithm. The clusters are ordered by their numbers in column A; the first cluster is a giant one (shown partially). The first line of each cluster is marked with fixed numbers in columns C–D; it contains the number of vertices (words) of that cluster in column E. Each of the subsequent lines corresponds to a word and contains the following data in columns A–J: the cluster number (A), the number of species in the cluster (B), the vertex degree (C), the vertex density, i.e., the number of graph parts this vertex is connected to (D), the species name (E), the sequence name (F), start position of the word in the sequence (G), the word length (H), DNA strand indicator (I), and the word itself (J). A part of the word shown in capital letters corresponds to the intersection of all words merged at this vertex (a group); lowercase letters correspond to the union of those words. If the word overlaps with regions of a gene and its coding sequence (CDS) according to the genome annotation available in GenBank, this word corresponds to a protein. In such cases, the gene data including the protein description is shown in columns K–O; and CDS data, in columns P–R. If only the first condition is satisfied, the word belongs to a gene untranslated region such as an intron; in this case, only the data on the gene are shown. If a word is a fragment of a known non-protein-coding RNA according to Rfam database, columns S–AB contain the RNA name and other data. The clusters that correspond to untranslated regions or unknown HCEs are highlighted in gold or blue, respectively, in column A. (XLSX 10204 kb)

本数据集详细展示了我们的算法在最终图中发现的所有簇(cluster)。簇按照A列的编号进行排序,首个簇为巨型簇(仅部分展示)。每个簇的第一行在C-D列带有固定编号,其中E列包含该簇的顶点(即本数据集中所称的单词)数量。后续每一行对应一个单词,在A-J列依次包含如下数据:簇编号(A列)、簇内物种数量(B列)、顶点度(C列)、顶点密度(即该顶点所连接的图组分数量,D列)、物种名称(E列)、序列名称(F列)、该单词在序列中的起始位置(G列)、单词长度(H列)、DNA链标记(I列)以及单词本身(J列)。单词中以大写字母显示的片段,对应于该顶点处合并的全部单词(一组)的交集;小写字母则对应这些单词的并集。若某单词与GenBank提供的基因组注释中标记的基因区域及其编码序列(CDS)存在重叠,则该单词对应一种蛋白质。此类场景下,包含蛋白质描述在内的基因数据将展示在K-O列,编码序列(CDS)数据则展示在P-R列。若仅满足首个条件(即仅与基因区域重叠而未覆盖其编码序列),则该单词属于基因非翻译区(如内含子),此时仅展示基因相关数据。若某单词为Rfam数据库中已知的非编码RNA片段,则S-AB列将展示该RNA的名称及其他相关数据。对应非翻译区或未知高保守元件(HCE)的簇,将分别在A列以金色和蓝色高亮标记。(XLSX格式,文件大小10204 KB)
提供机构:
Figshare
创建时间:
2017-12-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作