five

MPRA data of synthetic enhancers in hematopoiesis

收藏
DataCite Commons2025-03-15 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/MPRA_data_of_synthetic_enhancers_in_hematopoiesis/25713519/2
下载链接
链接失效反馈
官方服务:
资源简介:
OverviewR Data Archive file containing MPRA data measuring the activity of synthetic DNA constructs in 7 cell states of murine primary hematopietic stem and progenitor cells (HSPCs) , and K562 cells. See our manuscript, https://www.biorxiv.org/content/10.1101/2024.08.26.609645v1This file contains a main data object, <b>mpra.data</b>, a list over the different experiments:<i>HSPC.libA</i> : Library A (38 factors, one TFBS per enhancer), HSPC experiment<i>HSPC.libB</i> : Library B (10 factors, TFBS pairs), HSPC experiment<i>HSPC.libC</i> : Library C (42 factors, TFBS pairs), HSPC experiment<i>HSPC.libC.aggregate</i> : Library C, HSPC experiment, aggregated across cell states<i>HSPC.libD</i>: Library D (automated enhancer design), HSPC experiment<i>HSPC.libF: </i>Library F (Fli1-Spi1 and Gata2-Cebpa combinations), HSPC experiment<i>HSPC.libG</i>: Library G (Genomic sequences), HSPC experiment<i>HSPC.libH</i>: Library H (complex synthetic sequences with 3-12 FBS)<i>K562.libA.minP.tra</i> : Library A, K562 cell experiment<i>K562.libB.minP.tra</i> : Library B, K562 cell experiment<i>K562.libC.minP.tra</i> : Library C, K562 cell experiment<i>K562.libB.minCMV.tra</i> : Library B, K562 cell experiment, measured in a vector with the minimal CMV promoter instead of the default minimal promoter.<i>K562.libB.minP.int</i> : Library B, K562 cell experiment, measured 7 days after infection instead of 4 days after infectionEach list entry is a list of data frames, the exact format of which is explained below<i>DATA</i> : Data of main constructs<i>CONTROLS.GENERAL</i> : Various controls, including random DNA measurements obtained as part of the same experiment<i>CONTROLS.TP53</i> : An identical set of sequences from library A that was included in each experiment<i>BACKGROUND</i> : 90% confidence intervals of activity achieved by random DNADetailled explanation of main data frames (DATA)<br>Each row corresponds to a single gene regulatory element, measured in a single cell state. The following columns are present for all libraries:<i>clusterID</i> : The cell state where the measurement was performed. To map the entries to labels, use the vector <i>cellstate.map</i><i>CRS</i> : The unique ID of the gene regulatory element<i>Library</i> : The library (A, B or C)<i>Seq</i> : The DNA sequence. Capital letters correspond to placed motifs, small letters correspond to background DNA.<i>RNA.1 , RNA.2 , DNA.1 , DNA.2</i> : Molecule counts on DNA and RNA level in replicate 1 and 2<i>RNA.norm.1 , RNA.norm.2 , DNA.norm.1 , DNA.norm.2</i> : Library-size normalized molecule counts (???)<i>norm.1.raw , norm.2.raw</i> : Raw log2 of RNA/DNA counts in replicate 1 and 2<i>norm.1.adj , norm.2.adj </i>: log2 of RNA/DNA counts in replicate 1 and 2, subtracting the median activity of random DNA<i>mean.norm.raw</i> : Mean raw activity across replicates (log2 scale RNA/DNA)<i>mean.norm.adj </i>: Mean activity across replicates, using a scale where 0 is the median activity of random DNA in that cell state<i>mean.scaled.final</i> : Scaled activity measurement for visualization only, using a scale where 0 is the median activity of random DNA and 1 is the maximal activity achieved by Tp53 (not computed in K562 screens). Use <i>mean.norm.adj </i>for model training, statistics etc. The purpose of this is to account for different baselines and offsets achieved in different screens.The following columns regarding sequence design are only present in the single-factor library A:<i>TF</i> : The transcription factor placed on the DNA<i>nrepeats</i> : Number of placed motifs<i>affinitynum</i> : Affinity quantile (on a scale where 1 is the most likely sequence given the PWM, and 0 is any extremely unlikely sequence)<i>sum.biophys.affinity</i> : Sum of actual affinities across the sequence, computed using considerations from statistical thermodynamics.<i>orientation</i> : Orientation of the motif. forward (fwd), reverse(rev) or alternating fwd-rev (tandem)<i>spacer</i> : Numbers of base pairs spacing between motifs.The following columns are only present in the dual-factor library B+C:<i>TF1.name </i>: Name of the transcription factor whose motif appears first, coming from 5''<i>TF1.affinity</i> : Corresponding affinity (on a scale from 0 to 1)<i>TF1.orientation</i> : Corresponding orientation<i>TF2.name</i> : Name of the transcription factor whose motif appears second, coming from 5''<i>TF2.affinity</i> : Corresponding affinity (on a scale from 0 to 1)<i>TF2.orientation </i>: Corresponding orientation<i>spacer</i> : Spacing between sites<i>TFnumber</i> : Number of sites for each factor<i>TForder</i> : Arrangement of sites (Alternate or Block)The following colums are only present in the automated designn library D:<i>SubLibrary: </i>Whether the goal was to design enhancers with specific <i>activation </i>or <i>repression</i><i>Task_MegEry, </i><i>Task_Basophil, </i><i>Task_Eosinophil, </i><i>Task_Monocyte, </i><i>Task_Neutrophil, </i><i>Task_Immature: </i>Task definition in the different cell states. -3 for repression, -0.2 / 0.6 for inactivity (depending on whether the task was activation or repression), 1 for activity.<i>design_strategy</i>: Whether the design was initialized with a <i>random</i> sequence or a random forest model was used to identify an optimal TFBS combination (<i>model-guided</i>)<i>design_search: </i>Whether optimization was done with a <i>local</i> or <i>global</i> searchThe following columns are only present in the dual-factor library F:<i>spacer: </i>Spacing between sites<i>nFli1</i><i>, </i><i>nSpi1, </i><i>nCebpa, </i><i>nGata2</i><i> </i>Number of Fli1/Spi1/Cebpa/Gata2 sites<i>Fli1_affinities_sum, </i><i>Spi1_affinities_sum</i>, <i>Cebpa_affinities_sum, </i><i>Gata2_affinities_sum: </i><i>: </i>Sum of motif scores for Fli1, Spi1, Cebpa, Gata2The following columns are only present in the genomic library G:<i>chromosome, start_coordinate, end_coordinate: </i>Genomic coordinates (mm10)Example code for working with main DATA framesSubsetting LibB/LibC data framesTo extract all data belonging to a given pair of transcription factors from library B and C in a format that ignores the order of the sites (i.e. which TF comes first and which TF comes second), the RDA file contains a function <i>getsubset.libBC</i>. This function takes as arguments a DATA frame and two transcription factors, e.g.<i>getsubset.libBC("Spi1", "Fli1", mpra.data$HSPC.libC$DATA)</i>It returns a similar data frame, except that now, TF1.name is always Spi1 (no matter if Spi1 is placed first, or if Sli1 is placed first), and TF2.name is always Fli1. It also adds some convenient columns:<i>oricomb</i> : Orientation of both factors<i>affnum</i> : Number of weak and strong binding sites for woth factors (e.g. 3W-3S means that the first factor has 3 weak binding sites, and the second factor has 3 strong binding sites)Casting data framesTo convert the data frame into a format where one row is one sequence, and columns are measurements in different cell states, you can use:<i>require(reshape2)</i><i>casted.dataframe &lt;- dcast(mpra.data$HSPC.libA$DATA, CRS + Seq + nrepeats + affinitynum + spacer + orientation ~ clusterID, value.var = "mean.scaled.final")</i><i>casted.array &lt;- acast(mpra.data$HSPC.libA$DATA, Seq ~ clusterID, value.var = "mean.scaled.final")</i>Detailled explanation of control data frames (CONTROLS.GENERAL,CONTROLS.TP53 )These data frames contain the same general columns as the main DATA frames. Additionally, they contain the following columns:<i>ControlType</i> : The type of control, containing the following entries:<i>inactiveTFBS</i> : Inactive DNA controls for libA.<i>ScrambleTFBS</i> : Inactive DNA controls for libB and libC. Here, Sequences from the main library were taken and their TFBS was scrambledScrambleBackground : Sequences from the main library were taken and their background DNA was scrambled.<i>ScrambleFilling</i> : Sequences from the main library were taken and the filling DNA between motifs was scrambled.<i>ScrambleWhole</i> : Sequences from the main library were taken and all DNA was scrambled.<i>RevComp</i> : Sequences from the main library were taken and reverse complemented<i>ControlFor</i> : The sequence ID based on which the control was created<i>CONTROLS.TP53</i> contains an always identical subset of library A. Additional columns are the number (nrepeats) and affinity (affinitynum) of Tp53 motifs.Detailled explanation of background data frame (BACKGROUND)These data frames were obtained from computing the inactive DNA controls. 5th and 95th percentile of activities are given for each cell state (clusterID), on the scales <i>mean.norm.adj</i> and <i>mean.scaled.final</i>
提供机构:
figshare
创建时间:
2025-03-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作