five

Simulated pairs of nucleotide sequences for testing (alignment-free) genome distance estimate methods

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4034461
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains 24,000 pairs of nucleotide sequences (and associated parameters) that have been simulated for testing alignment-free genome distance estimates. Given an evolutionary distance d varying from 0.05 to 1.00 nucleotide substitutions per character (step = 0.05), the program INDELible was used to simulate the evolution of 200 nucleotide sequence pairs with d substitution events per character under the models GTR and GTR+Γ. Each model was adjusted with three different equilibrium frequencies: f1: equal frequencies, i.e. freq(A) = freq(C) = freq(G) = freq(T) = 0.25, f2: GC-rich, i.e. freq(A) = 0.1, freq(C) = 0.3, freq(G) = 0.4, freq(T) = 0.2, f3: AT-rich, i.e. freq(A) = freq(T) = 0.4, freq(C) = freq(G) = 0.1. For each simulated sequence pair, model parameters (i.e. GTR: six relative rates of nucleotide substitution; GTR+Γ: six rates and one Γ shape parameter) were randomly drawn from 142 sets of parameters derived from real-case data (see file GTR.params.trees.tsv at https://zenodo.org/record/4034261). Initial sequence length was 5 Mbs, and an indel rate of 0.01 was set with indel length drawn from [1, 50000] according to a Zipf distribution with parameter 1.5 (see INDELible manual).   For each of the 20 evolutionary distances d = 0.05, 0.10, ..., 1.00, six XZ-compressed files containing 200 simulation data are available: data-d-f1-nogam.tsv.xz   data simulated under the model GTR with equilibrium frequencies f1 data-d-f1-gamma.tsv.xz   data simulated under the model GTR+Γ with equilibrium frequencies f1 data-d-f2-nogam.tsv.xz   data simulated under the model GTR with equilibrium frequencies f2 data-d-f2-gamma.tsv.xz   data simulated under the model GTR+Γ with equilibrium frequencies f2 data-d-f3-nogam.tsv.xz   data simulated under the model GTR with equilibrium frequencies f3 data-d-f3-gamma.tsv.xz   data simulated under the model GTR+Γ with equilibrium frequencies f3   Each file is tab-delimited and contains the 18 following fields: [1]      integer seed value specified to INDELible, [2-5]    frequencies of T, C, A, G, respectively, specified to INDELible, [6-10]  C-T, A-T, G-T, A-C, C-G rate parameters, respectivly (normalized such that A-G rate = 1), specified to INDELible, [11]     Γ shape parameter alpha (= 0 in the nogam files, i.e. GTR substitution model without Γ) specified to INDELible, [12]     length lgt1 of the first sequence seq1 (i.e. no. A, C, G, T in seq1), [13]     length lgt2 of the second sequence seq2 (i.e. no. A, C, G, T in seq2), [14]     no. sites in aligned sequences seq1 and seq2 (i.e. no. A, C, G, T and gap character states in seq1 or seq2), [15]     no. non-gapped sites (core sites) in aligned sequences seq1 and seq2, [16]     observed p-distance between aligned sequences seq1 and seq2 (i.e. no. nucleotide mismatches divided by no. core sites), [17]     aligned seq1 (containing indel gaps), [18]     aligned seq2 (containing indel gaps). Of note, seq1 and seq2 (fields [17-18]) being aligned, these two entries are two strings with identical no. sites (field [14]). Gap character states (-) should be removed from seq1 and seq2 to obtain the unaligned sequences. _____ Criscuolo A (2020) On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Research, 9:1309. doi:10.12688/f1000research.26930.1
创建时间:
2020-11-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作