EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life
收藏Figshare2020-07-01 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/EukProt_a_database_of_genome-scale_predicted_proteins_across_the_diversity_of_eukaryotic_life/12417881/1
下载链接
链接失效反馈官方服务:
资源简介:
<b>Version 1</b> (8 May, 2019)<br><br>A database of published/publicly available predicted protein sets and unannotated genomes selected to represent eukaryotic diversity, including 708 species across all major supergroups (Amorphea, Archaeplastida, CRuMs, Cryptista, Discoba, Haptista, Hemimastigophora, Metamonada, TSAR) and orphan taxa (Ancyromonadida, Malawimonadidae, Picozoa) (Burki <i>et al</i>. 2019, DOI: 10.1016/j.tree.2019.08.008).<br><br><b>EukProt_proteins.v01.2019_05_08.tgz</b>: predicted protein sets, for 694 species with either a genome with predicted proteins (242 species) or a transcriptome (452 species).<br><br><b>EukProt_unannotated_genomes.v01.2019_05_08.tgz</b>: genomes, for 14 species with genomic data lacking predicted proteins (these are almost exclusively single-cell genomes).<br><br><b>EukProt_assembled_transcriptomes.v01.2019_05_08.tgz</b>: contigs, for 46 species with publicly available reads but no publicly available transcriptome assembly. The proteins predicted from these assemblies are included in the proteins file.<br><br><b>EukProt_included_data_sets.v01.2019_05_08.txt</b> and <b>EukProt_not_included_data_sets.v01.2019_05_08.txt</b>: tables of information on data sets either included or not included in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:<br><br><i>EukProt_ID</i>: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.<br><i>Name_to_Use</i>: the name of the species for protein/genome/assembled transcriptome files.<br><br><i>Strain</i>: the strain(s) of the species sequenced.<br><i>Previous_Names</i>: any previous names that this species was known by, not including cases where a species was originally assigned to a genus but not identified to the species level (e.g., <i>Goniomonas</i> sp., now identified as <i>Goniomonas avonlea</i>, is not listed as a previous name).<br><i>Replaces_EukProt_ID</i>/<i>Replaced_by_EukProt_ID</i> (included for forward compatibility): if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).<br><i>Genus_UniEuk</i>, <i>Epithet_UniEuk</i>, <i>Supergroup_UniEuk</i>, <i>Taxogroup_UniEuk</i>: taxonomic identifiers at different levels of the UniEuk taxonomy (based on Adl <i>et al</i>. 2018, DOI: 10.1111/jeu.12691).<br><i>Taxonomy_UniEuk</i>: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).<br><i>Merged_Strains</i>: whether multiple strains of the same species were merged to create the data set.<br><i>Data_Source_URL</i>: the URL(s) from which the data were downloaded.<br><i>Data_Source_Name</i>: the name of the data set (as assigned by the data source).<br><i>Paper_DOI</i>: the DOI(s) of the paper(s) that published the data set.<br><i>Actions_Prior_to_Use</i>: the action(s) that were taken to process the publicly available files in order to produce the data set in this database, excluding genomes lacking annotations (these are provided as is, with the label ‘translated sequence search’ indicating that proteins of interest can be identified with translated sequence homology search software). Actions taken:‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/‘extractfeat’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/All parameter values were default, unless otherwise specified.<br><i>Data_Source_Type</i>: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).<br><i>Notes</i>: additional information on the data set (for example, why it was not included).
提供机构:
Colomban De Vargas
创建时间:
2020-07-01



