MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10586949

下载链接

链接失效反馈

官方服务：

资源简介：

Purpose Reference sequence libraries for annotation of marine metatranscriptomes exist publicly with notable differences in content, quality control, taxonomic specificity and domain-level organismal focus[1]. This data repository combines the combined products of the MarFERReT v1.1 marine microbial eukaryote sequence library[1][2] and MARMICRODB v1.0 prokaryote-focused marine genome database[3][4], designed with the purpose of supplementing the curated marine eukaryote-focused MarFERReT library with marine prokaryote sequence representation through MARMICRODB. This repository contains a handful of shell, R, and python processing scripts described in the Methods, in addition to a combined FASTA sequence file and a .tab file linking sequences to their NCBI taxonomy ID. The scripts are also hosted on the MarFERReT code repository: https://github.com/armbrustlab/marferret/tree/main/scripts/marferret_marmicrodb Methods and scripts marmicrodb_processing.sh Shell script that describes how to download the MARMICRODB v1.0 files and run a number of accessory scripts: 1. Download MARMICRODB database from the Zenodo repository 2. Run [filter_marmicrodb_entries.R]: Filters out MAGs, eukaryotes, and entries without NCBI taxIDs 3. Run [process_marmicrodb_fasta.py]: Generate a FASTA file and taxonomy table to combine with MarFERReT 3. Run [marmicrodb_remove_numeric_seqs.py]: Removes a subset of sequences with numerical values in sequence fields merge_marferret_marmicrodb.sh Combines the FASTA file and taxonomy tab file from above with the equivalent files from the MarFERReT v1.1 eukaryote sequence library; and contains code for creating an indexed binary database of the combined files for annotation using the DIAMOND[6] fast read alignment program. 1. Concatenation of Gzip-compressed FASTA and taxonomy table files available in this repository: MarFERReT.MARMICRODB.v1.1.combined.faa.gz and MarFERReT.MARMICRODB.v1.1.combined.uid2tax.tab.gz 2. Code to run DIAMOND makedb [] using above files within a Singularity container (indexed DIAMOND database not included here). filter_marmicrodb_entries.R For our goal of integrating MARMICRODB with the eukaryotic MarFERReT library, we are removing some of the entries from MARMICRODB: 1. Metagenome-assembled genomes (MAGs), for stringent reference organism identity 2. MMETSP marine eukaryotes, which are redundant with MarFERReT sequence data. 3. Entries without a valid NCBI taxID For this, we want to process the MARMICRODB fasta file to keep an entry only if 1. it is not a MAG, 2. it is not a eukaryote, and 3. it has a valid NCBI taxID. The following code uses this file to identify the MARMICRODB entries from downloaded MARMICRODB files that satisfy our three criteria. Run this R script to filter out MAGs, eukaryotes, and entries without NCBI taxIDs. process_marmicrodb_fasta.py Run a custom python script to keep the sequences for the entry IDs found in the filter_marmicrodb_entries.R output files, generating a filtered FASTA file and a 'UID2TAX' file for use downstream with the DIAMOND fast protein aligner. marmicrodb_remove_numeric_seqs.py A small number of MARMICRODB.faa sequences contain unexpected numeric values in the amino acid sequence strings. This script removes a total of 323,835 sequences with numeric residues (1.2% of total sequences). Data files MarFERReT.MARMICRODB.v1.1.combined.faa.gz Gzip-compressed FASTA file containing the concatenated protein sequences from MarFERReT v1.1 and a subset of sequences from MARMICRODB v1.0, totalling 55,517,966 sequences. Generated as described in merge_marferret_marmicrodb.sh. MarFERReT.MARMICRODB.v1.1.combined.uid2tax.tab.gz This Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each protein sequence in MarFERReT.MARMICRODB.v1.1.combined.faa.gz is listed in this file together with its NCBI Taxonomy identifier. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis. Generated as described in merge_marferret_marmicrodb.sh The columns in this file contain the following information: accession: (NA) accession.version: The unique sequence identifier ('mftX' or 'mmdbX'). taxid: The NCBI Taxonomy ID associated with this reference sequence. gi: (NA). References If you use these data, please cite the original publication and data source for the MarFERReT and MARMICRODB reference databases as well as the specific DOI for this combined data product: Groussman, R. D., Blaskowski, S., Coesel, S. N., & Armbrust, E. V. (2023). MarFERReT, an open-source, version-controlled reference library of marine microbial eukaryote functional genes. Scientific Data, 10(1), 926. https://doi.org/10.1038/s41597-023-02842-4 Groussman, R. D., Blaskowski, S., Coesel, S., & Armbrust, E. V. (2023). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10170983 Becker, J. W., Hogle, S. L., Rosendo, K., & Chisholm, S. W. (2019). Co-culture and biogeography of Prochlorococcus and SAR11. The ISME journal, 13(6), 1506-1519. https://doi.org/10.1038/s41396-019-0365-4 Hogle, S. L. (2019). MARMICRODB database for taxonomic classification of (marine) metagenomes (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3520509 Groussman, R. D. (2024). MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library. (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10586950

创建时间：

2024-01-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集