Data from: A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering
收藏DataCite Commons2025-04-01 更新2025-04-09 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.s3h9r6j
下载链接
链接失效反馈官方服务:
资源简介:
Sequencing of target-enriched libraries is an efficient and cost-effective
method for obtaining DNA sequence data from hundreds of nuclear loci for
phylogeny reconstruction. Much of the cost of developing targeted
sequencing approaches is associated with the generation of preliminary
data needed for the identification of orthologous loci for probe design.
In plants, identifying orthologous loci has proven difficult due to a
large number of whole-genome duplication events, especially in the
angiosperms (flowering plants). We used multiple sequence alignments from
over 600 angiosperms for 353 putatively single-copy protein-coding genes
identified by the One Thousand Plant Transcriptomes Initiative to design a
set of targeted sequencing probes for phylogenetic studies of any
angiosperm group. To maximize the phylogenetic potential of the probes
while minimizing the cost of production, we introduce a k-medoids
clustering approach to identify the minimum number of sequences necessary
to represent each coding sequence in the final probe set. Using this
method, five to 15 representative sequences were selected per orthologous
locus, representing the sequence diversity of angiosperms more efficiently
than if probes were designed using available sequenced genomes alone. To
test our approximately 80,000 probes, we hybridized libraries from 42
species spanning all higher-order groups of angiosperms, with a focus on
taxa not present in the sequence alignments used to design the probes. Out
of a possible 353 coding sequences, we recovered an average of 283 per
species and at least 100 in all species. Differences among taxa in
sequence recovery could not be explained by relatedness to the
representative taxa selected for probe design, suggesting that there is no
phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp
of coding sequence, achieved a median recovery of 137 kbp per taxon in
coding regions, a maximum recovery of 250 kbp, and an additional median of
212 kbp per taxon in flanking non-coding regions across all species. These
results suggest that the Angiosperms353 probe set described here is
effective for any group of flowering plants and would be useful for
phylogenetic studies from the species level to higher-order groups,
including the entire angiosperm clade itself.
提供机构:
Dryad
创建时间:
2018-12-04



