Quantification of the effects of chimerism: datasets

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/5877922

下载链接

链接失效反馈

官方服务：

资源简介：

To aid in exploring the effects of chimerism on read mapping, differential expression analysis and de novoassembly, a base set of 26,680 transcripts containing all sequences ranging in length of between 300 and 5000 nt present within the fruit fly cDNA library was created from Ensembl release-100 (https://www.ensembl.org/info/data/ftp/index.html) [1]. These transcripts along with the complete cDNA library from which they were compiled are located within the BaseSetTranscripts.zip file. This base set of transcripts was used as a reference for simulating reads as required within subsequent sections of our paper (titled: Quantification of the effects of chimerism on read mapping, differential expression and annotation following short-read de novo assembly.). Such simulations often involved hundreds of replicate iterations due to the nature of the study, as well as for the associated creation of modified base sets containing varying portions of chimerism. Parameter values used for read simulations and modified reference sets used within iterations are described in detail within the paper (F1000 paper link to be provided when available.). To explore the effects of chimerism on the detection of differentially expressed transcripts ten read datasets, each consisting of five million read-pairs, were simulated using CSReadGen [2] from the base set as described section 2.2 of the manuscript. These are located within the DEReads.zip file. Using these reads differential expression analysis was repeated iteratively, where during each iteration ChimSim [3] was used to create a modified base set to be used as a reference. Within each modified base set created a portion of the transcripts present were made chimeric. The portions of chimerism introduced ranged from 5% to 95% chimeric in steps of five. These modified base sets are located within the d DEChimSimRefs.zip file. In each case a titles file has also been provided that indicates which transcripts within the base set were made chimeric (if any, e.g. at 0% chimerism this file is empty) and the manner in which chimeras was introduced in accordance to the three types discussed in the paper. For example in the file titled chimeric_refs_0.1_SEQS.fasta 10% of the sequences are chimeric and the file titled chimeric_refs_0.1_TITLES.txt indicates which these are and the type of chimerism introduced. The base set was then used to simulate ten data sets consisting of ten million read-pairs that were each assembled using CStone [4], Trinity [5] and rnaSPAdes [6]. Parameters for read simulations are once again described in detail within our paper (Section 2.3). The assemblies produced by each assembler are contained within the DeNovoAssemblies_SimulatedData.zip file. The two whole body read datasets from Pang et al. [7], following filtering by Trimmomatic [8] as described in our paper, are within the files Reads_RealData_WholeBody_1.zip and Reads_RealData_WholeBody_2.zip, as are the assemblies produced by each of the three assemblers when using these reads as input (DeNovoAssemblies_RealData.zip). Related software to this project are: 1. CStone 2. CSReadGen 3. CView 4. ChimSim < 5. TVScript A related poster discussing the the identification of chimerism during assembly is available here (DOI: 10.5281/zenodo.6022493) and one discussing the effects of chimerism is available here (DOI: 10.5281/zenodo.6023170). General details of the project are available here. References 1. Yates AD, Achuthan P, Akanni W, Allen J, Allen J, Alvarez-Jarreta J, et al. Ensembl 2020. Nucleic Acids Res. 2020;48: D682–D688. doi:10.1093/NAR/GKZ966 2. Archer J. CSReadGen website. 2020. Available: https://sourceforge.net/projects/csreadgen/ 3. Linheiro, Raquel; Archer J. ChimSim website. 2021. Available: https://sourceforge.net/projects/chimsim/ 4. Linheiro R, Archer J. CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure. Pertea M, editor. PLOS Comput Biol. 2021;17: e1009631. doi:10.1371/JOURNAL.PCBI.1009631 5. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011 297. 2011;29: 644–652. doi:10.1038/nbt.1883 6. Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. Gigascience. 2019;8: 1–13. doi:10.1093/GIGASCIENCE/GIZ100 7. Pang TL, Ding Z, Liang SB, Li L, Zhang B, Zhang Y, et al. Comprehensive Identification and Alternative Splicing of Microexons in Drosophila. Front Genet. 2021;12. doi:10.3389/fgene.2021.642602 8. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114–2120. doi:10.1093/BIOINFORMATICS/BTU170

创建时间：

2022-02-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集