five

Arabidopsis thaliana CNSs verified in at least 2 CNS lists

收藏
DataCite Commons2025-05-01 更新2024-07-25 收录
下载链接:
https://figshare.com/articles/dataset/Arabidopsis_thaliana_CNSs_verified_in_at_least_2_CNS_lists/1422166/1
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is a list of <em>Arabidopsis thaliana</em> CNS sequences present in at least two of the three following CNS lists: 1) Haudry et al. (2013) An atalas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45:891-898. 2) PL3.0 (TAIR 10 version): Turco et al. (2013) Automated conserved noncoding sequence (CNS) discovery reveals differences in gene content and promoter evolution among grasses. Frontiers in Plant Genetics and Genomics 4:170-180. 3) Van de Velde et al (2014) Inferences of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis. Plant Cell 26:2729-2745. CNS sequences found in at least 2 of the 3 CNS lists were identified using multiIntersectBed from the BEDTools suite (Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842). CNSs from the verified2 list were assigned to an <em>Arabidopsis thaliana</em> gene based on their PL3.0 component. PL3.0 CNSs are defined as syntenic conserved noncoding regions between <em>Arabidopsis thaliana</em> and the early branching Brassicaceae <em>Aethionema arabicum</em>. Orthologous <em>Arabidopsis thaliana</em>-<em>Aethionema arabicum</em> genes were identified using a combination of CoGe: Synfind (Tang et al. (2011) BMC Bioinformatics 12:102) and the PL3.0 CNS pipeline (Turco et al. 2013). closestBed (Bedtools) was then used to map PL3.0 CNSs to the closest <em>Arabidopsis thaliana</em> gene with an <em>Aethionema arabicum</em> ortholog. Distance to the nearest gene is included in the closestBed output. Proximal regions were defined as being 1000 bp upstream from the transcription start site (5' proximal) or 1000 bp downstream from the gene (3' proximal). CNSs without a PL3.0 component were also assigned to an <em>Arabidopsis thaliana</em> gene if they were intragenic or if they were in the genespace of an arabidopsis gene, with the genespace being defined as the region between and encompassing the 5'-most PL3.0 CNS and the 3'-most PL3.0 CNS. For intragenic CNSs, a custom perlscript was used to identify the position of the CNS in introns vs UTRs. Overlap with UTRs and CDS regions was calculated using intersectBed (BEDTools) using bedfiles created from GFF "UTR", "gene", and "CDS" features. CNS sequences overlapping CDSs by 50% or more were given "CDS" designations. CNSs overlapping UTRs by 50% or more were given 5' or 3' UTR designations. Note: CNS assignments to <em>Arabidopsis thaliana</em> genes are best-guess computational assignments; individual PL3.0 CNSs may in actuality function in regulating genes that are not the closest <em>Arabidopsis thaliana</em> gene with an <em>Aethionema arabicum</em> ortholog. This is particularly true for genes with complex regulation. In the GEvo links included in this spreadsheet these can often be seen as clusters of CNSs extending beyond the midpoint between two <em>Arabidopsis thaliana</em> genes. By adding additional orthologous genes to GEvo panels, it is often possible to assign a CNS to an <em>Arabidopsis thaliana</em> gene with greater confidence if only one of the two <em>Arabidopsis thaliana</em> genes is retained in all genomes along with the CNS.

本数据集为拟南芥(*Arabidopsis thaliana*)保守非编码序列(conserved noncoding sequences, CNS)的集合,这些序列至少同时出现在以下3个CNS数据集的2个之中:1)Haudry等人(2013)发表于《自然·遗传学》(*Nat. Genet.*)45卷:891-898页的研究《超过90000个保守非编码序列的图谱为十字花科调控区域研究提供新视角》;2)PL3.0(TAIR 10版本):Turco等人(2013)发表于《植物遗传学与基因组学前沿》(*Frontiers in Plant Genetics and Genomics*)4卷:170-180页的《自动化保守非编码序列(CNS)挖掘揭示禾本科植物间基因含量与启动子进化差异》;3)Van de Velde等人(2014)发表于《植物细胞》(*Plant Cell*)26卷:2729-2745页的《通过保守非编码序列分析推断拟南芥转录调控网络》。利用BEDTools工具集的multiIntersectBed(Quinlan AR与Hall IM,2010,发表于《生物信息学》(*Bioinformatics*)26卷6期:841-842页的《BEDTools:用于比较基因组特征的灵活工具集》),可在上述3个CNS数据集中识别出至少出现在2个数据集中的目标CNS序列。将来自verified2列表的CNS序列基于其PL3.0组分分配至对应的拟南芥基因。PL3.0 CNS被定义为拟南芥与早期分支十字花科植物阿拉伯岩芥(*Aethionema arabicum*)之间的同线性保守非编码区域。拟南芥与阿拉伯岩芥的直系同源基因通过结合CoGe: Synfind(Tang等人,2011,《BMC生物信息学》(*BMC Bioinformatics*)12:102)与PL3.0 CNS分析流程(Turco等人,2013)进行鉴定。随后使用BEDTools的closestBed工具,将PL3.0 CNS映射至带有阿拉伯岩芥直系同源基因的最近邻拟南芥基因,closestBed的输出结果包含了序列与最近基因的间距信息。近端区域被定义为转录起始位点上游1000 bp(5'近端)或基因下游1000 bp(3'近端)。对于不带有PL3.0组分的CNS序列,若其属于基因内区域,或位于拟南芥基因的基因空间内,则同样会被分配至该拟南芥基因;其中基因空间被定义为最上游的PL3.0 CNS与最下游的PL3.0 CNS之间并包含这两个位点的区域。针对基因内CNS,使用自定义Perl脚本识别其在内含子与非翻译区(UTR)中的位置。利用BEDTools的intersectBed工具,基于从GFF文件提取的"UTR""gene"与"CDS"特征生成的BED文件,计算CNS与UTR、CDS区域的重叠情况。若CNS与CDS区域重叠比例≥50%,则将其归类为CDS相关序列;若与UTR区域重叠比例≥50%,则分别归类为5' UTR或3' UTR相关序列。注:本数据集中拟南芥基因的CNS分配为基于计算的最佳推测结果;实际情况下,单个PL3.0 CNS可能调控的并非其映射的带有阿拉伯岩芥直系同源基因的最近邻拟南芥基因,在调控复杂的基因中这一现象尤为明显。在本电子表格附带的GEvo链接中,常可观察到CNS簇延伸至两个拟南芥基因的中点之外。若仅两个拟南芥基因中的一个在所有基因组中均保留且伴随该CNS,则可通过在GEvo面板中添加更多直系同源基因,提升CNS基因分配的置信度。
提供机构:
figshare
创建时间:
2016-01-19
二维码
社区交流群
二维码
科研交流群
商业服务