five

Mash Sketch of RefSeq Bacterial Reference Genomes

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13901152
下载链接
链接失效反馈
官方服务:
资源简介:
The mash reference that can be downloaded from the mash documentaion is for RefSeq version 70. I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now.  RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.Don't hesitate to submit an issue if this needs to get updated.I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.This is how the mash reference file was created: # Step 1. Download Datasets and Dataformat wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat chmod +x datasets dataformat # Step 2. Download Mash wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar tar -xvf mash-Linux64-v2.3.tar # Step 3. Get a list of all the genomes # Note: this also changes how some of the names are represented datasets summary genome taxon bacteria --reference --as-json-lines | \ dataformat tsv genome --fields accession,organism-name --elide-header | \ sed 's/\[//g' | \ sed 's/\]//g' | \ sed 's/["'\'']//g' | \ sed 's/endosymbiont of /endosymbiont_of_/g' > \ ids.txt # Step 4. Download the reference files and sketch them # Note: Since this is done in Github Actions (GA), I need to keep everything below 30G. # The best way to do this is to download the process each reference file individually, and then combine it to the whole. # This obviously does not need to be followed if not under those same limitations. while read line do id=$(echo $line | awk '{print $1}') ge=$(echo $line | awk '{print $2}') if [ ! -n "$ge" ] ; then ge="unknown" ; fi sp=$(echo $line | awk '{print $3}') if [ ! -n "$sp" ] ; then sp="unknown" ; fi datasets download genome accession $id unzip ncbi_dataset.zip cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta if [ ! -f RefSeqSketches_${version}.msh ] then mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version} else mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id} mv RefSeqSketches_${version}.msh tmp.msh mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh rm tmp.msh ${ge}_${sp}_${id}.msh fi rm ${ge}_${sp}_${id}.fasta rm -rf ncbi_dataset/ rm ncbi_dataset.zip rm README.md rm md5sum.txt done < ids.txt To use # download file wget mask sketch sample.fasta RefSeqSketches_.msh > mash_results.txt # These results are unsorted, so many find it useful to sort them. sort -gk3 mash_results.txt > sorted_mash_results.txt       The should look like the following: 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_pyogenes_GCF_900475035.1.fasta 0.0116661 0 643/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_dysgalactiae_GCF_016128095.1.fasta 0.0782587 0 107/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_canis_GCF_900636575.1.fasta 0.132399 2.34894e-153 32/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_agalactiae_GCF_001552035.1.fasta 0.164662 1.32611e-72 16/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_castoreus_GCF_000425025.1.fasta 0.174408 2.34302e-58 13/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_didelphis_GCF_000380005.1.fasta 0.182269 8.30736e-49 11/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_uberis_GCF_900475595.1.fasta 0.186761 5.62934e-44 10/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_iniae_GCF_000831485.1.fasta 0.191731 3.33152e-39 9/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_ictaluri_GCF_000188015.2.fasta 0.197292 1.75608e-34 8/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_phocae_GCF_001302265.1.fasta 0.203604 2.46548e-30 7/1000
创建时间:
2025-03-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作