five

ITS NCBI Qiime2 format no uncultured fungi

收藏
DataCite Commons2025-06-01 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/dataset/ITS_NCBI_Qiime2_format_no_uncultured_fungi/14702727/1
下载链接
链接失效反馈
官方服务:
资源简介:
Qiime2 formatted NCBI ITS database (fasta + taxonomy) for analysis of fungi ITS amplicon sequencing. All sequences that have not been identified at least to Phylum level were removed.<br>Data download: search -db nuccore -query "\"\(internal transcribed spacer 1\"[All Fields] AND \"fungi\"[Filter] AND \(250[SLEN] : 10000[SLEN]\)\) NOT \"uncultured Neocallimastigales\"[porgn] NOT \"bacteria\"[Filter] NOT \"uncultured fungus\"[Filter] NOT \"Uncultured fungus\"[Filter] NOT \"fungal sp.\"[Filter]" | efetch -format fasta -mode text &gt; ./NCBI_ITS1_DB_raw.fasta<br><br>Data processing (https://github.com/gzahn/tools/blob/master/make_qiime_database_from_fasta.sh)<br><br>### Search for and remove any empty sequences ###gawk 'BEGIN {RS = "&gt;" ; FS = "\n" ; ORS = ""} {if ($2) print "&gt;"$0}' NCBI_ITS1_DB_raw.fasta &gt; NCBI_ITS1_DB_raw.fasta.tidy<br><br># Obtain NCBI taxonomy lineages for your input fastapython2 /home/bioinf/bin/entrez_qiime.py -i NCBI_ITS1_DB_raw.fasta.tidy -o NCBI_Taxonomy.txt -r kingdom,phylum,class,order,family,genus,species -a /media/bioinf/Data/NCBI_tax2021/nucl_gb.accession2taxid -n /media/bioinf/Data/NCBI_tax2021<br><br>### Validate and Tidy up files ###<br>### Edit output file to include rank IDs (QIIME needs them for some scripts)cat NCBI_Taxonomy.txt | sed 's/\t/\tk__/' | sed 's/;/&gt;p__/' | sed 's/;/&gt;c__/' | sed 's/;/&gt;o__/' | sed 's/;/&gt;f__/' | sed 's/;/&gt;g__/' | sed 's/;/&gt;s__/' | sed 's/&gt;/;/g' &gt; NCBI_QIIME_Taxonomy.txt<br>### Edit database to single-line fasta formatawk '/^&gt;/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' &lt; NCBI_ITS1_DB_raw.fasta.tidy &gt; NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta<br>### Remove first blank linesed -i '/^$/d' NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta<br>### Remove trailing descriptions after Accession No.sed -i 's/ .*//' NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta<br>### compare read counts in fasta and txt filesgrep -c "^&gt;" NCBI_ITS1_DB_raw.fasta.tidy.oneline.fastawc -l NCBI_QIIME_Taxonomy.txt<br>#if numbers are different, there are duplicates introduced by entrez_qiime.py<br>### if some duplicates may appear in fasta file (i.e., more reads than taxonomy IDs), get lists of Seq/Taxonomy IDs and remove duplicates from fasta file<br>cut -f 1 NCBI_QIIME_Taxonomy.txt &gt; Tax_Namesgrep "^&gt;" NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta | cut -d " " -f 1 | sed 's/&gt;//g' &gt; DB_Namessort DB_Names | uniq -d &gt; Duplicated_IDsgrep -A1 -f Duplicated_IDs NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta | sed '/^--/d' &gt; Duplicated_fastasfor fn in Duplicated_fastas; do count=$(wc -l add_back; donegrep -v -f Duplicated_IDs NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta &gt; tidy.no_reps.fastacat tidy.no_reps.fasta add_back &gt; DB_raw.fasta<br>### Sort fasta database to same order as taxonomy map<br>echo "Sorting Database...This will take some time."<br>cut -f 1 NCBI_QIIME_Taxonomy.txt &gt; IDs_in_order.txtwhile read ID ; do grep -m 1 -A 1 "^&gt;$ID" DB_raw.fasta ; done &lt; IDs_in_order.txt &gt; DB.fasta #This will take quite a long time to run<br>mv NCBI_QIIME_Taxonomy.txt Taxonomy.txt<br>rm DB_Names DB_raw.fasta Duplicated_fastas Duplicated_IDs IDs_in_order.txt NCBI_Taxonomy.txt Tax_Names tidy.no_reps.fasta NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta NCBI_ITS1_DB_raw.fasta.tidy add_back<br>cat NCBI_ITS1_DB_raw.fasta.log<br>grep "^&gt;" DB.fasta | sed 's/&gt;//' &gt;good_acc_list<br>echo "Cleaning Taxonomy to match Database...This may take some time."<br>while read ID ; do grep -m 1 $ID Taxonomy.txt ; done &lt; good_acc_list &gt; Taxonomy_ordered.txt#mv $4/Taxonomy_ordered.txt $4/Taxonomy.txt#rm $4/good_acc_list<br>grep "k__NA;p__NA;c__NA;o__NA;f__NA;g__NA;s__NA\|^:" Taxonomy_ordered.txt | cut -f1 &gt; bad_acc_list<br><br><br>sed -e '/k__NA;p__NA;c__NA;o__NA;f__NA;g__NA;s__NA/d' Taxonomy_ordered.txt &gt; Taxonomy_clean1.txtsed -e '/^:/d' Taxonomy_clean1.txt &gt; Taxonomy.txt<br>echo "Final cleanup to remove bad accessions..."<br>while read bad; do echo "Removing $bad" ; sed -i -e "/$bad/,+1d" DB.fasta ; done &lt; bad_acc_listsed -i -e '/^&gt;:/,+1d' DB.fasta<br><br>grep "^&gt;" DB.fasta | sed 's/&gt;//' &gt; DB_IDs_orderedwhile read ID; do grep $ID Taxonomy_ordered.txt ; done &lt; DB_IDs_ordered &gt; Taxonomy_final.txt<br><br>rm Taxonomy_clean1.txt Taxonomy_ordered.txtmv bad_acc_list bad_acc_list.txt<br><br>echo -e "Process complete. Final database is DB_ordered.fasta, and associated taxonomy is Taxonomy_ordered.txt\nAccessions that were removed are in bad_acc_list.txt"<br>
提供机构:
figshare
创建时间:
2021-06-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作