ITS NCBI Qiime2 format no uncultured fungi
收藏DataCite Commons2025-06-01 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/dataset/ITS_NCBI_Qiime2_format_no_uncultured_fungi/14702727/1
下载链接
链接失效反馈官方服务:
资源简介:
Qiime2 formatted NCBI ITS database (fasta + taxonomy) for analysis of fungi ITS amplicon sequencing. All sequences that have not been identified at least to Phylum level were removed.<br>Data download: search -db nuccore -query "\"\(internal transcribed spacer 1\"[All Fields] AND \"fungi\"[Filter] AND \(250[SLEN] : 10000[SLEN]\)\) NOT \"uncultured Neocallimastigales\"[porgn] NOT \"bacteria\"[Filter] NOT \"uncultured fungus\"[Filter] NOT \"Uncultured fungus\"[Filter] NOT \"fungal sp.\"[Filter]" | efetch -format fasta -mode text > ./NCBI_ITS1_DB_raw.fasta<br><br>Data processing (https://github.com/gzahn/tools/blob/master/make_qiime_database_from_fasta.sh)<br><br>### Search for and remove any empty sequences ###gawk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} {if ($2) print ">"$0}' NCBI_ITS1_DB_raw.fasta > NCBI_ITS1_DB_raw.fasta.tidy<br><br># Obtain NCBI taxonomy lineages for your input fastapython2 /home/bioinf/bin/entrez_qiime.py -i NCBI_ITS1_DB_raw.fasta.tidy -o NCBI_Taxonomy.txt -r kingdom,phylum,class,order,family,genus,species -a /media/bioinf/Data/NCBI_tax2021/nucl_gb.accession2taxid -n /media/bioinf/Data/NCBI_tax2021<br><br>### Validate and Tidy up files ###<br>### Edit output file to include rank IDs (QIIME needs them for some scripts)cat NCBI_Taxonomy.txt | sed 's/\t/\tk__/' | sed 's/;/>p__/' | sed 's/;/>c__/' | sed 's/;/>o__/' | sed 's/;/>f__/' | sed 's/;/>g__/' | sed 's/;/>s__/' | sed 's/>/;/g' > NCBI_QIIME_Taxonomy.txt<br>### Edit database to single-line fasta formatawk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < NCBI_ITS1_DB_raw.fasta.tidy > NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta<br>### Remove first blank linesed -i '/^$/d' NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta<br>### Remove trailing descriptions after Accession No.sed -i 's/ .*//' NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta<br>### compare read counts in fasta and txt filesgrep -c "^>" NCBI_ITS1_DB_raw.fasta.tidy.oneline.fastawc -l NCBI_QIIME_Taxonomy.txt<br>#if numbers are different, there are duplicates introduced by entrez_qiime.py<br>### if some duplicates may appear in fasta file (i.e., more reads than taxonomy IDs), get lists of Seq/Taxonomy IDs and remove duplicates from fasta file<br>cut -f 1 NCBI_QIIME_Taxonomy.txt > Tax_Namesgrep "^>" NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta | cut -d " " -f 1 | sed 's/>//g' > DB_Namessort DB_Names | uniq -d > Duplicated_IDsgrep -A1 -f Duplicated_IDs NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta | sed '/^--/d' > Duplicated_fastasfor fn in Duplicated_fastas; do count=$(wc -l add_back; donegrep -v -f Duplicated_IDs NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta > tidy.no_reps.fastacat tidy.no_reps.fasta add_back > DB_raw.fasta<br>### Sort fasta database to same order as taxonomy map<br>echo "Sorting Database...This will take some time."<br>cut -f 1 NCBI_QIIME_Taxonomy.txt > IDs_in_order.txtwhile read ID ; do grep -m 1 -A 1 "^>$ID" DB_raw.fasta ; done < IDs_in_order.txt > DB.fasta #This will take quite a long time to run<br>mv NCBI_QIIME_Taxonomy.txt Taxonomy.txt<br>rm DB_Names DB_raw.fasta Duplicated_fastas Duplicated_IDs IDs_in_order.txt NCBI_Taxonomy.txt Tax_Names tidy.no_reps.fasta NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta NCBI_ITS1_DB_raw.fasta.tidy add_back<br>cat NCBI_ITS1_DB_raw.fasta.log<br>grep "^>" DB.fasta | sed 's/>//' >good_acc_list<br>echo "Cleaning Taxonomy to match Database...This may take some time."<br>while read ID ; do grep -m 1 $ID Taxonomy.txt ; done < good_acc_list > Taxonomy_ordered.txt#mv $4/Taxonomy_ordered.txt $4/Taxonomy.txt#rm $4/good_acc_list<br>grep "k__NA;p__NA;c__NA;o__NA;f__NA;g__NA;s__NA\|^:" Taxonomy_ordered.txt | cut -f1 > bad_acc_list<br><br><br>sed -e '/k__NA;p__NA;c__NA;o__NA;f__NA;g__NA;s__NA/d' Taxonomy_ordered.txt > Taxonomy_clean1.txtsed -e '/^:/d' Taxonomy_clean1.txt > Taxonomy.txt<br>echo "Final cleanup to remove bad accessions..."<br>while read bad; do echo "Removing $bad" ; sed -i -e "/$bad/,+1d" DB.fasta ; done < bad_acc_listsed -i -e '/^>:/,+1d' DB.fasta<br><br>grep "^>" DB.fasta | sed 's/>//' > DB_IDs_orderedwhile read ID; do grep $ID Taxonomy_ordered.txt ; done < DB_IDs_ordered > Taxonomy_final.txt<br><br>rm Taxonomy_clean1.txt Taxonomy_ordered.txtmv bad_acc_list bad_acc_list.txt<br><br>echo -e "Process complete. Final database is DB_ordered.fasta, and associated taxonomy is Taxonomy_ordered.txt\nAccessions that were removed are in bad_acc_list.txt"<br>
提供机构:
figshare
创建时间:
2021-06-01



