5000-het: Dataset of Nucleotide Sequences with a Form of Evolutionary Sequence Length Heterogeneity

Name: 5000-het: Dataset of Nucleotide Sequences with a Form of Evolutionary Sequence Length Heterogeneity
Creator: University of Illinois at Urbana-Champaign
Published: 2023-08-23 16:33:58
License: 暂无描述

DataCite Commons2023-08-23 更新2025-04-16 收录

下载链接：

https://databank.illinois.edu/datasets/IDB-3974819

下载链接

链接失效反馈

官方服务：

资源简介：

Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often does not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., in GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/). For more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.

提供机构：

University of Illinois at Urbana-Champaign

创建时间：

2022-08-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集