Simulated data set of chimeric transposable elements.
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12065158
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is composed of 9,000 sequences of transposable elements (TEs) of 20,000 bp. Three cases of transposable elements with artifacts at the ends, with another chimeric TE or with simple repeats were considered for the generation of the sequences. The sequences of the first case consist of a DNA fragment + first TE + DNA fragment + second TE + DNA fragment. The sequences of the second case consist of a first TE + second TE + repeat of the first TE. The third case sequences consist of a microsatellite that is repeated in tandem by placing the extracted TE at position 10,000 (in the middle), occupying both sides to the ends. All TEs used in this dataset were taken from Dfam, for the species Drosophila melanogaster.
The identifier of each sequence has the information about the case, TE Dfam identifier, TE initial position inside the sequence, and the TE length, all separated by "_". For example:
Caso1_DF000001548.2_6926_5126
File description:
dataset.zip: The fasta file containing all the sequences
features_data.npy.zip: A numpy file containing the numerical representation of the four TE+Aid plots generated for the sequences presented in the dataset.zip file. This data is actually a numpy array with dimensions 9000x256x256x3x4
labels_data.numpy.zip: A numpy file containing the starting and ending position (normalized between 0 and 1) of each TE presented in the dataset.zip file.
The last two files were generated to train a neural network for trimming out automatically artifacts in DNA sequences.
创建时间:
2024-08-07



