five

Advances in understanding cis regulation of the plant gene with an emphasis on comparative genomics|植物基因调控数据集|比较基因组学数据集

收藏
DataCite Commons2025-06-01 更新2024-07-27 收录
植物基因调控
比较基因组学
下载链接:
https://figshare.com/articles/dataset/Advances_in_understanding_cis_regulation_of_the_plant_gene_with_an_emphasis_on_comparative_genomics/1397563/1
下载链接
链接失效反馈
资源简介:
This dataset is a list of <em>Arabidopsis thaliana</em> CNSs concatenated from the following CNS lists: 1) Haudry et al. (2013) An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45:891-898. 2) PL3.0 (TAIR 10 version): Turco et al. (2013) Automated conserved noncoding sequence (CNS) discovery reveals differences in gene content and promoter evolution among grasses. Frontiers in Plant Genetics and Genomics 4:170-180. 3) Van de Velde et al (2014) Inferences of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis. Plant Cell 26:2729-2745. CNSs from the individual lists were concatenated. PL3.0 CNSs, syntenic conserved noncoding regions between <em>Arabidopsis thaliana</em> and the early branching Brassicaceae <em>Aethionema arabicum</em>, were assigned to the closest <em>Arabidopsis thaliana</em> gene with an <em>Aethionema arabicum</em> ortholog. Orthologous <em>Arabidopsis thaliana-Aethionema arabicum</em> genes were identified using a combination of CoGe: Synfind (Tang et al. (2011) BMC Bioinformatics 12:102) and the PL3.0 CNS pipeline output (Turco et al. 2013). closestBed (Bedtools) was then used to map PL3.0 CNSs to the closest <em>Arabidopsis thaliana</em> gene which had an <em>Aethionema arabicum</em> ortholog. Distance to the nearest gene is included in the closestBed output. Proximal regions were defined as being 1000 bp upstream from the transcription start site (5' proximal) or 1000 bp downstream from the gene (3' proximal). For intragenic CNSs, a custom perlscript was used to identify the position of the CNS in introns vs UTRs. Overlap with UTRs and CDS regions was calculated using intersectBed (BEDTools) using bedfiles created from GFF "UTR" and "CDS" features. CNS sequences overlapping CDSs by 50% or more were given "CDS" designations. CNSs overlapping UTRs by 50% or more were given 5' or 3' UTR designations. CNSs from the Haudry and Van de Velde CNS lists were then assigned to an <em>Arabidopsis thaliana</em> gene if they were present in the genespace of an arabidopsis gene, with the genespace being defined as the region between and encompassing the 5'-most PL3.0 CNS and the 3'-most PL3.0 CNS. Once assigned to an arabidopsis gene, the distance to that gene was calculated using closestBed (BEDTools) and intersectBed was used, as above, to identify the position of intragenic CNSs. An A<em>rabidopsis thaliana</em> genome has been made available on CoGe, dsgid 25725, decorated with 2 sets of CNSs: 1) the PL3.0 CNSs from this datasheet and 2) a merged set of CNSs from the PL3.0, Haudry, and Van de Velde CNS lists. To see the CNSs, in Results Visualization Options, set "Show preannotated CNSs?" to "Yes". Note: CNS assignments to <em>Arabidopsis thaliana</em> genes are best-guess computational assignments; individual PL3.0 CNSs may in actuality function in regulating genes that are not the closest <em>Arabidopsis thaliana</em> gene with an <em>Aethionema arabicum</em> ortholog. This is particularly true for genes with complex regulation. In the GEvo links included in this spreadsheet these can often be seen as clusters of CNSs extending beyond the midpoint between two <em>Arabidopsis thaliana</em> genes. By adding additional orthologous genes to GEvo panels, it is often possible to assign a CNS to an <em>Arabidopsis thaliana</em> gene with greater confidence if only one of the two <em>Arabidopsis thaliana</em> genes is retained in all genomes along with the CNS.
提供机构:
figshare
创建时间:
2016-01-19
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国食物成分数据库

食物成分数据比较准确而详细地描述农作物、水产类、畜禽肉类等人类赖以生存的基本食物的品质和营养成分含量。它是一个重要的我国公共卫生数据和营养信息资源,是提供人类基本需求和基本社会保障的先决条件;也是一个国家制定相关法规标准、实施有关营养政策、开展食品贸易和进行营养健康教育的基础,兼具学术、经济、社会等多种价值。 本数据集收录了基于2002年食物成分表的1506条食物的31项营养成分(含胆固醇)数据,657条食物的18种氨基酸数据、441条食物的32种脂肪酸数据、130条食物的碘数据、114条食物的大豆异黄酮数据。

国家人口健康科学数据中心 收录

中国1km分辨率逐月降水量数据集(1901-2023)

该数据集为中国逐月降水量数据,空间分辨率为0.0083333°(约1km),时间为1901.1-2023.12。数据格式为NETCDF,即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集,通过Delta空间降尺度方案在中国降尺度生成的。并且,使用496个独立气象观测点数据进行验证,验证结果可信。本数据集包含的地理空间范围是全国主要陆地(包含港澳台地区),不含南海岛礁等区域。为了便于存储,数据均为int16型存于nc文件中,降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理,Matlab发布了读入与存储nc文件的函数,读取函数为ncread,切换到nc文件存储文件夹,语句表达为:ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent]),其中XXX.nc为文件名,为字符串需要’’;var是从XXX.nc中读取的变量名,为字符串需要’’;i、j、t分别为读取数据的起始行、列、时间,leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样,研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令,可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心 收录

Tropicos

Tropicos是一个全球植物名称数据库,包含超过130万种植物的名称、分类信息、分布数据、图像和参考文献。该数据库由密苏里植物园维护,旨在为植物学家、生态学家和相关领域的研究人员提供全面的植物信息。

www.tropicos.org 收录

YOLO Drone Detection Dataset

为了促进无人机检测模型的开发和评估,我们引入了一个新颖且全面的数据集,专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集,包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象,以实现强大的检测和分类。

github 收录

Titanic Dataset

Titanic Data Analysis: A Journey into Passenger Profiles and Survival Dynamics

kaggle 收录