LukProt - an animal evolution-centric eukaryotic protein database

Mendeley Data2024-06-19 更新2024-06-28 收录

下载链接：

https://zenodo.org/records/11321046

下载链接

链接失效反馈

官方服务：

资源简介：

LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purpose of the database is to be a resource to look for whether a given protein or domain is present in large clades and to reconstruct its pedigree. The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo). The datasets that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format: (A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number assigned to each sequence within a taxon. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed. A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/. Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference: Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1 Holozoa (excluding Metazoa) 31 40 39 43 Ctenophora 2 2 35 38 Porifera 4 5 30 47 Placozoa 2 2 3 6 Cnidaria 3 5 65 88 Bilateria 51 51 94 142 Included with the database are: main database files - ready to use LukProt_v1.5.1_single_species_FASTA.7z -- a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence. LukProt_v1.5.1_full_BLAST_db.7z -- a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB LukProt_v1.5.1_taxogroup_BLAST_db.7z -- a collection of BLAST databases where each dataset is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB LukProt_v1.5.1_single_species_BLAST_db.7z -- a collection of BLAST databases where each dataset is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB metadata: a README file LukProt_IDs_mapped.txt.gz -- a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different BUSCO_tables.ods -- a spreadsheet with full result tables generated by BUSCO analysis OMArk_output.zip -- a folder with the results of all OMArk analyses LukProt_metadata_sheet.ods -- a spreadsheet with information about each dataset (in an open .ods format, most compatible with LibreOffice) data manipulation scripts a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience. other files - see README changelog Words of caution: The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. They convention is not expected to change any more in future updates. Many datasets, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript is not full). For this reason, to get accurate sequences from each organism, the users are directed to source data. Refer to the included OMArk and BUSCO data for details. The taxonomy is different to UniEuk/EukMap but UniEuk data were integrated where possible. A few NCBI taxids are missing. A number of datasets present in some metadata, are unpublished and were held back. While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established. Acknowledgements: Andrew E. Allen Lab for creating the original PhyloDB. Daniel Richter et al. for creating EukProt and keeping it updated. Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science. All the authors of the original datasets. National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.

创建时间：

2024-05-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集