Taxonomic classification based on k-mers

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/7040144

下载链接

链接失效反馈

官方服务：

资源简介：

DNA sequencing provides the possibility to obtain complete genomic DNA from environmental samples without the need for laboratory microbiological cultures. To this end, metagenomics, the direct DNA sequencing from microbial communities, has changed radically the field of microbiology, by unearthing a broad space of the planet’s microbial diversity, much of which remains unknown. Metagenomic approaches have become standard methods for identifying the biodiversity and the gene or metabolomic functionalities of bacterial and archaeal communities, with many applications not only in microbial ecology but also in public health as in clinical diagnostics and detection of pathogens. Yet, the decrease in the cost of high throughput sequencing and the great amount of microbial data produced every day, highlight one of the main biological questions: the taxonomic classification of metagenomic short reads. Which organisms are contained in a sample? Are there any features that can be used to identify them? To this end, many algorithms have been developed that achieve high speed, by counting k-mers, short sequence substrings of fixed-length k. In this way for the provided input sequences, a list of features can be computed that describes each one of them. Subsequently, the question now reforms to how can the produced k-mers be used for the taxonomic classification of the input sequences. Sample processing, sequencing, and core amplicon data analysis were performed by the Earth Microbiome Project (www.earthmicrobiome.org), and all amplicon sequence data and metadata have been made public through the EMP data portal (qiita.microbio.me/emp): Thompson, L. R., Sanders, J. G., McDonald, D., Amir, A., …, Jansson, J. K., Gilbert, J. A., Knight, R., & The Earth Microbiome Project Consortium. (2017). A communal catalogue reveals Earth’s multiscale microbial diversity. Nature, 551:457-463. doi:10.1038/nature24621.

DNA测序为无需借助实验室微生物培养即可从环境样本中获取完整基因组DNA提供了可行途径。为此，宏基因组学（metagenomics）——直接对微生物群落进行DNA测序的研究手段——通过发掘地球广袤的微生物多样性空间（其中绝大多数仍未被认知），彻底重塑了微生物学研究领域。宏基因组学方法现已成为鉴定细菌与古菌群落生物多样性、基因功能及代谢组学功能的标准技术，其应用场景不仅涵盖微生物生态学，还可延伸至公共卫生领域，例如临床诊断与病原体检测。然而，高通量测序（high throughput sequencing）成本的持续降低与每日产生的海量微生物数据，凸显了核心生物学问题之一：宏基因组短读长序列的分类学归类。样本中包含哪些微生物？是否存在可用于识别它们的特征？为此，诸多算法通过计数固定长度为k的短序列子串k-mer（k-mers）实现了高速运算。针对输入序列，可通过该方式计算得到一组描述每条序列的特征列表。至此，研究问题便转化为：如何利用生成的k-mer对输入序列进行分类学归类。样本处理、测序及核心扩增子（amplicon）数据分析由地球微生物组计划（Earth Microbiome Project，www.earthmicrobiome.org）完成，所有扩增子序列数据与元数据均通过EMP数据门户（qiita.microbio.me/emp）公开： Thompson, L. R., Sanders, J. G., McDonald, D., Amir, A. 等, Jansson, J. K., Gilbert, J. A., Knight, R. 及地球微生物组计划联盟. (2017). 公共目录揭示地球多尺度微生物多样性. 《自然》, 551:457-463. doi:10.1038/nature24621.

创建时间：

2022-09-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集