Arabidopsis thaliana CNSs verified in at least 2 CNS lists
收藏Figshare2016-01-19 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Arabidopsis_thaliana_CNSs_verified_in_at_least_2_CNS_lists/1422166/1
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is a list of <em>Arabidopsis thaliana</em> CNS sequences present in at least two of the three following CNS lists: 1) Haudry et al. (2013) An atalas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45:891-898. 2) PL3.0 (TAIR 10 version): Turco et al. (2013) Automated conserved noncoding sequence (CNS) discovery reveals differences in gene content and promoter evolution among grasses. Frontiers in Plant Genetics and Genomics 4:170-180. 3) Van de Velde et al (2014) Inferences of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis. Plant Cell 26:2729-2745. CNS sequences found in at least 2 of the 3 CNS lists were identified using multiIntersectBed from the BEDTools suite (Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842). CNSs from the verified2 list were assigned to an <em>Arabidopsis thaliana</em> gene based on their PL3.0 component. PL3.0 CNSs are defined as syntenic conserved noncoding regions between <em>Arabidopsis thaliana</em> and the early branching Brassicaceae <em>Aethionema arabicum</em>. Orthologous <em>Arabidopsis thaliana</em>-<em>Aethionema arabicum</em> genes were identified using a combination of CoGe: Synfind (Tang et al. (2011) BMC Bioinformatics 12:102) and the PL3.0 CNS pipeline (Turco et al. 2013). closestBed (Bedtools) was then used to map PL3.0 CNSs to the closest <em>Arabidopsis thaliana</em> gene with an <em>Aethionema arabicum</em> ortholog. Distance to the nearest gene is included in the closestBed output. Proximal regions were defined as being 1000 bp upstream from the transcription start site (5' proximal) or 1000 bp downstream from the gene (3' proximal). CNSs without a PL3.0 component were also assigned to an <em>Arabidopsis thaliana</em> gene if they were intragenic or if they were in the genespace of an arabidopsis gene, with the genespace being defined as the region between and encompassing the 5'-most PL3.0 CNS and the 3'-most PL3.0 CNS. For intragenic CNSs, a custom perlscript was used to identify the position of the CNS in introns vs UTRs. Overlap with UTRs and CDS regions was calculated using intersectBed (BEDTools) using bedfiles created from GFF "UTR", "gene", and "CDS" features. CNS sequences overlapping CDSs by 50% or more were given "CDS" designations. CNSs overlapping UTRs by 50% or more were given 5' or 3' UTR designations. Note: CNS assignments to <em>Arabidopsis thaliana</em> genes are best-guess computational assignments; individual PL3.0 CNSs may in actuality function in regulating genes that are not the closest <em>Arabidopsis thaliana</em> gene with an <em>Aethionema arabicum</em> ortholog. This is particularly true for genes with complex regulation. In the GEvo links included in this spreadsheet these can often be seen as clusters of CNSs extending beyond the midpoint between two <em>Arabidopsis thaliana</em> genes. By adding additional orthologous genes to GEvo panels, it is often possible to assign a CNS to an <em>Arabidopsis thaliana</em> gene with greater confidence if only one of the two <em>Arabidopsis thaliana</em> genes is retained in all genomes along with the CNS.
本数据集为一份拟南芥(Arabidopsis thaliana)保守非编码序列(conserved noncoding sequences, CNS)列表,这些序列至少出现在以下3份CNS列表中的2份:1) Haudry等人(2013)发表于《自然·遗传学》(*Nat. Genet.*)的研究:《包含超过90000个保守非编码序列的图谱为十字花科调控区域研究提供新思路》,45卷:891-898;2) PL3.0(TAIR 10版本):Turco等人(2013)发表于《植物遗传学与基因组学前沿》(*Frontiers in Plant Genetics and Genomics*)的研究:《自动化保守非编码序列(CNS)挖掘揭示禾本科植物间基因含量与启动子进化差异》,4卷:170-180;3) Van de Velde等人(2014)发表于《植物细胞》(*Plant Cell*)的研究:《通过保守非编码序列分析推断拟南芥转录调控网络》,26卷:2729-2745。
本研究利用BEDTools套件中的multiIntersectBed工具(Quinlan AR与Hall IM, 2010. 《BEDTools:用于比较基因组特征的灵活工具集》,*Bioinformatics* 26卷第6期:841–842),成功识别出同时出现在3份CNS列表中至少2份的CNS序列。
从verified2列表中获取的CNS序列,将基于其PL3.0组分匹配至对应拟南芥基因。PL3.0 CNS被定义为拟南芥与早期分支十字花科植物阿拉伯岩芥(Aethionema arabicum)之间的同线性保守非编码区域。拟南芥与阿拉伯岩芥的直系同源基因通过结合CoGe: Synfind工具(Tang等人(2011) *BMC Bioinformatics* 12:102)与PL3.0 CNS分析流程(Turco等人2013)共同鉴定得到。随后借助BEDTools中的closestBed工具,将PL3.0 CNS映射至带有阿拉伯岩芥直系同源基因的最近邻拟南芥基因,该工具的输出结果包含了序列与最近基因的距离信息。
近端区域被定义为转录起始位点上游1000 bp(5'近端)或基因下游1000 bp(3'近端)。对于不带有PL3.0组分的CNS序列,若其处于基因内区域,或属于某拟南芥基因的基因空间,则同样会被匹配至该基因;其中基因空间被定义为位于最上游PL3.0 CNS与最下游PL3.0 CNS之间且包含这两个位点的区域。针对基因内CNS,研究人员使用自定义Perl脚本区分其位于内含子还是UTR区域。利用intersectBed工具(BEDTools),结合从GFF文件中提取的"UTR"、"gene"及"CDS"特征构建的bed文件,可计算CNS与UTR、CDS区域的重叠情况。若CNS与CDS区域重叠比例≥50%,则将其归类为CDS相关序列;若与UTR区域重叠比例≥50%,则分别归类为5' UTR或3' UTR相关序列。
注:本数据集中拟南芥基因与CNS的匹配结果为基于计算的最佳推测匹配;实际研究中,单个PL3.0 CNS可能调控的并非其匹配的带有阿拉伯岩芥直系同源基因的最近邻拟南芥基因,对于存在复杂调控模式的基因而言这一现象尤为明显。在本电子表格附带的GEvo链接中,常可观察到CNS簇延伸至两个拟南芥基因中点之外的情况。若仅两个拟南芥基因中的一个在所有基因组中均得以保留且伴随该CNS存在,则可通过在GEvo面板中添加更多直系同源基因,以更高置信度完成CNS与拟南芥基因的匹配。
创建时间:
2015-05-21



