five

MPRA data of synthetic enhancers in hematopoiesis

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/MPRA_data_of_synthetic_enhancers_in_hematopoiesis/25713519
下载链接
链接失效反馈
官方服务:
资源简介:
OverviewR Data Archive file containing MPRA data measuring the activity of synthetic DNA constructs in 7 cell states of murine primary hematopietic stem and progenitor cells (HSPCs) , and K562 cells. See our manuscript, https://www.biorxiv.org/content/10.1101/2024.08.26.609645v1 This file contains a main data object, mpra.data, a list over the different experiments: HSPC.libA : Library A (38 factors, one TFBS per enhancer), HSPC experimentHSPC.libB : Library B (10 factors, TFBS pairs), HSPC experimentHSPC.libC : Library C (42 factors, TFBS pairs), HSPC experimentHSPC.libC.aggregate : Library C, HSPC experiment, aggregated across cell statesHSPC.libD: Library D (automated enhancer design), HSPC experimentHSPC.libF: Library F (Fli1-Spi1 and Gata2-Cebpa combinations), HSPC experimentHSPC.libG: Library G (Genomic sequences), HSPC experimentHSPC.libH: Library H (complex synthetic sequences with 3-12 FBS)K562.libA.minP.tra : Library A, K562 cell experimentK562.libB.minP.tra : Library B, K562 cell experimentK562.libC.minP.tra : Library C, K562 cell experimentK562.libB.minCMV.tra : Library B, K562 cell experiment, measured in a vector with the minimal CMV promoter instead of the default minimal promoter.K562.libB.minP.int : Library B, K562 cell experiment, measured 7 days after infection instead of 4 days after infectionEach list entry is a list of data frames, the exact format of which is explained below DATA : Data of main constructsCONTROLS.GENERAL : Various controls, including random DNA measurements obtained as part of the same experimentCONTROLS.TP53 : An identical set of sequences from library A that was included in each experimentBACKGROUND : 90% confidence intervals of activity achieved by random DNADetailled explanation of main data frames (DATA) Each row corresponds to a single gene regulatory element, measured in a single cell state. The following columns are present for all libraries: clusterID : The cell state where the measurement was performed. To map the entries to labels, use the vector cellstate.mapCRS : The unique ID of the gene regulatory elementLibrary : The library (A, B or C)Seq : The DNA sequence. Capital letters correspond to placed motifs, small letters correspond to background DNA.RNA.1 , RNA.2 , DNA.1 , DNA.2 : Molecule counts on DNA and RNA level in replicate 1 and 2RNA.norm.1 , RNA.norm.2 , DNA.norm.1 , DNA.norm.2 : Library-size normalized molecule counts (???)norm.1.raw , norm.2.raw : Raw log2 of RNA/DNA counts in replicate 1 and 2norm.1.adj , norm.2.adj : log2 of RNA/DNA counts in replicate 1 and 2, subtracting the median activity of random DNAmean.norm.raw : Mean raw activity across replicates (log2 scale RNA/DNA)mean.norm.adj : Mean activity across replicates, using a scale where 0 is the median activity of random DNA in that cell statemean.scaled.final : Scaled activity measurement for visualization only, using a scale where 0 is the median activity of random DNA and 1 is the maximal activity achieved by Tp53 (not computed in K562 screens). Use mean.norm.adj for model training, statistics etc. The purpose of this is to account for different baselines and offsets achieved in different screens.The following columns regarding sequence design are only present in the single-factor library A: TF : The transcription factor placed on the DNAnrepeats : Number of placed motifsaffinitynum : Affinity quantile (on a scale where 1 is the most likely sequence given the PWM, and 0 is any extremely unlikely sequence)sum.biophys.affinity : Sum of actual affinities across the sequence, computed using considerations from statistical thermodynamics.orientation : Orientation of the motif. forward (fwd), reverse(rev) or alternating fwd-rev (tandem)spacer : Numbers of base pairs spacing between motifs.The following columns are only present in the dual-factor library B+C: TF1.name : Name of the transcription factor whose motif appears first, coming from 5''TF1.affinity : Corresponding affinity (on a scale from 0 to 1)TF1.orientation : Corresponding orientationTF2.name : Name of the transcription factor whose motif appears second, coming from 5''TF2.affinity : Corresponding affinity (on a scale from 0 to 1)TF2.orientation : Corresponding orientationspacer : Spacing between sitesTFnumber : Number of sites for each factorTForder : Arrangement of sites (Alternate or Block)The following colums are only present in the automated designn library D: SubLibrary: Whether the goal was to design enhancers with specific activation or repressionTask_MegEry, Task_Basophil, Task_Eosinophil, Task_Monocyte, Task_Neutrophil, Task_Immature: Task definition in the different cell states. -3 for repression, -0.2 / 0.6 for inactivity (depending on whether the task was activation or repression), 1 for activity.design_strategy: Whether the design was initialized with a random sequence or a random forest model was used to identify an optimal TFBS combination (model-guided)design_search: Whether optimization was done with a local or global searchThe following columns are only present in the dual-factor library F: spacer: Spacing between sitesnFli1, nSpi1, nCebpa, nGata2 Number of Fli1/Spi1/Cebpa/Gata2 sitesFli1_affinities_sum, Spi1_affinities_sum, Cebpa_affinities_sum, Gata2_affinities_sum: : Sum of motif scores for Fli1, Spi1, Cebpa, Gata2The following columns are only present in the genomic library G: chromosome, start_coordinate, end_coordinate: Genomic coordinates (mm10)Example code for working with main DATA framesSubsetting LibB/LibC data framesTo extract all data belonging to a given pair of transcription factors from library B and C in a format that ignores the order of the sites (i.e. which TF comes first and which TF comes second), the RDA file contains a function getsubset.libBC. This function takes as arguments a DATA frame and two transcription factors, e.g. getsubset.libBC("Spi1", "Fli1", mpra.data$HSPC.libC$DATA) It returns a similar data frame, except that now, TF1.name is always Spi1 (no matter if Spi1 is placed first, or if Sli1 is placed first), and TF2.name is always Fli1. It also adds some convenient columns: oricomb : Orientation of both factorsaffnum : Number of weak and strong binding sites for woth factors (e.g. 3W-3S means that the first factor has 3 weak binding sites, and the second factor has 3 strong binding sites)Casting data framesTo convert the data frame into a format where one row is one sequence, and columns are measurements in different cell states, you can use: require(reshape2) casted.dataframe <- dcast(mpra.data$HSPC.libA$DATA, CRS + Seq + nrepeats + affinitynum + spacer + orientation ~ clusterID, value.var = "mean.scaled.final") casted.array <- acast(mpra.data$HSPC.libA$DATA, Seq ~ clusterID, value.var = "mean.scaled.final") Detailled explanation of control data frames (CONTROLS.GENERAL,CONTROLS.TP53 )These data frames contain the same general columns as the main DATA frames. Additionally, they contain the following columns: ControlType : The type of control, containing the following entries: inactiveTFBS : Inactive DNA controls for libA.ScrambleTFBS : Inactive DNA controls for libB and libC. Here, Sequences from the main library were taken and their TFBS was scrambledScrambleBackground : Sequences from the main library were taken and their background DNA was scrambled.ScrambleFilling : Sequences from the main library were taken and the filling DNA between motifs was scrambled.ScrambleWhole : Sequences from the main library were taken and all DNA was scrambled.RevComp : Sequences from the main library were taken and reverse complementedControlFor : The sequence ID based on which the control was created CONTROLS.TP53 contains an always identical subset of library A. Additional columns are the number (nrepeats) and affinity (affinitynum) of Tp53 motifs. Detailled explanation of background data frame (BACKGROUND)These data frames were obtained from computing the inactive DNA controls. 5th and 95th percentile of activities are given for each cell state (clusterID), on the scales mean.norm.adj and mean.scaled.final
创建时间:
2024-08-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作