Test data set for comparative evaluation of transposable element curation in eukaryotic species

Name: Test data set for comparative evaluation of transposable element curation in eukaryotic species
Creator: DIGITAL.CSIC
Published: 2024-10-01 07:22:29
License: 暂无描述

DataCite Commons2024-10-01 更新2025-04-09 收录

下载链接：

https://digital.csic.es/handle/10261/362092

下载链接

链接失效反馈

官方服务：

资源简介：

[Description of methods used for collection/generation of data] We generated raw TE libraries, or used the ones already available, for the six species analyzed, with three de novo tools, namely RepeatModeler2 (RM2; Flynn et al., 2020), EDTA (Ou et al., 2019) and TEdenovo from REPET (Flutre et al., 2011). To generate the libraries the following assemblies were used: GCA_002050065.1 for D. melanogaster, GCF_000738735.6 for C. cornix, and GCA_000002035.4 for D. rerio, the RGAP assembly version 7 for O. sativa (available at https://phytozome-next.jgi.doe.gov/info/Osativa_v7_0; Ouyang et al., 2007), and the T2T assemblies GCA_022117705.1 for Z. mays, and GCF_009914755 for H. sapiens. For the for C. cornix, Z. mays and H. sapiens RM2 raw libraries, we generated them using the TEtools Docker image available at https://github.com/Dfam-consortium/TETools. For for D. melanogaster, O. sativa and D. rerio we used the raw libraries available at: https://github.com/jmf422/TE_annotation/tree/master/benchmark_libraries/RM2. For EDTA libraries, we generated them for the six species using the EDTA pipeline v.2.0 with --anno 0 and the other parameters by default. Regarding the REPET libraries, for the D. melanogaster genome, we ran the TEdenovo pipeline using default parameters and followed the recommended steps in the user's guideline (https://urgi.versailles.inra.fr/Tools/REPET/TEdenovo-tuto). For the D. rerio genome, we created a subset of 300 Mb using the PreProcess.py script available with the REPET package and then we ran TEdenovo pipeline using default parameters as for the D. melanogaster genome (Jamilloux et al., 2016). The REPET group kindly provided us with the output generated with the O. sativa genome available in the RepetDB database (Amselem et al., 2019). They also provide us with the libraries for C. cornix and Z. mays. As the reference libraries, we used the Berkeley Drosophila Genome Project (BDGP) dataset (Kaminker et al., 2002) and the Manual Curated TE library (MCTE) provided in Rech, et al., (2022) for D. melanogaster, the library published by Ou et al. (2019) for O. sativa (referred to as "standard library" by the authors), the MClibrary available at Weissensteiner et al., (2020) for C. cornix, the manual curated TE models available in Dfam release 3.7 (J. Storer et al., 2021) and in Repbase version 20181026 for D. rerio, MTEC curated by Ou et al (https://github.com/oushujun/MTEC) for Z. mays, and curated TE models in Dfam 3.7 for H. sapiens. For the rice "standard library”, Dfam and Repbase libraries, we unified the LTR sequences with their corresponding internal part before performing further analysis, using in-house scripts. MCHelper v.1.7.0 was executed with default parameters (to see customizable parameters see Github: https://github.com/GonzalezLab/MCHelper). In the False positive filtering step, MCHelper requires a gene set to detect homology between the consensus sequences and multicopy gene families. We used the following BUSCO gene sets: for D. melanogaster, we used the diptera set (diptera_odb10), for O. sativa and Z. mays, the viriplantae set (viriplantae_odb10), for C. cornix, the aves set (aves_odb10), for D. rerio, the actinopterygii set (actinopterygii_odb10), and for H. sapiens, the mammalian set (mammalia_odb10). To accommodate MCHelper's expectation of a single file with all the hidden Markov models (HMMs), we concatenated all the available HMM files into a single file for each species. For the annotation we ran RepeatMasker v.4.1.2-p1 (Smit et al., 2015) on each of the three genomes with the raw, reference and MCHelper libraries using the -lib parameter, and the following additional parameters: -gff -nolow -no_is -norna. We then used the OneCodeToFindThemAll script (Bailly-Bechet et al., 2014) in each genome annotation to defragment the copies, and we used the “--strict” parameter for D. melanogaster, D. rerio, and C. cornix. [File List] this dataset is composed of six folders, each one corresponding to a species. The species are: D. melanogaster (fruit fly), O. sativa (rice), C. cornix (crow), D. rerio (zebrafish), Z. mays (corn) and H. sapiens (human). The folder for each species contains the assembly used, the BUSCO gene dataset (except for Z. mays because the BUSCO dataset is the same as for O. sativa), a folder for each the de novo program used to generate the TE library (RM2 (RepeatModeler2), EDTA and REPET), plus a folder with the files for the reference library. Each de novo program folder contains the raw library (produced by that program), a CSV containing the length by order of the TE consensus sequences, the output data of the MCHelper execution (in the folder curation_ite16), as well as the annotation (in the folder RM_annot). The REPET folder has an extra file describing the classification and structural features (ending in denovoLubTEs_PC.classif). Note that REPET was not run for the H. sapiens genome.

提供机构：

DIGITAL.CSIC

创建时间：

2024-06-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集