Simulated data set of chimeric transposable elements.

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/12065158

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is composed of 9,000 sequences of transposable elements (TEs) of 20,000 bp. Three cases of transposable elements with artifacts at the ends, with another chimeric TE or with simple repeats were considered for the generation of the sequences. The sequences of the first case consist of a DNA fragment + first TE + DNA fragment + second TE + DNA fragment. The sequences of the second case consist of a first TE + second TE + repeat of the first TE. The third case sequences consist of a microsatellite that is repeated in tandem by placing the extracted TE at position 10,000 (in the middle), occupying both sides to the ends. All TEs used in this dataset were taken from Dfam, for the species Drosophila melanogaster. The identifier of each sequence has the information about the case, TE Dfam identifier, TE initial position inside the sequence, and the TE length, all separated by "_". For example: Caso1_DF000001548.2_6926_5126 File description: dataset.zip: The fasta file containing all the sequences features_data.npy.zip: A numpy file containing the numerical representation of the four TE+Aid plots generated for the sequences presented in the dataset.zip file. This data is actually a numpy array with dimensions 9000x256x256x3x4 labels_data.numpy.zip: A numpy file containing the starting and ending position (normalized between 0 and 1) of each TE presented in the dataset.zip file. The last two files were generated to train a neural network for trimming out automatically artifacts in DNA sequences.

创建时间：

2024-08-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集