LukProt - an animal evolution-centric eukaryotic protein database
收藏Mendeley Data2024-06-27 更新2024-06-28 收录
下载链接:
https://zenodo.org/records/10522407
下载链接
链接失效反馈官方服务:
资源简介:
LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purpose of the database is to be a resource to look for whether a given protein or domain is present in large clades and to reconstruct its pedigree. The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo). The datasets that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format: (A/E/L)PXXXXX_Species_epithet_(strain)_PXXXXXX where X is a number from 0 to 9. Each sequence is assigned a unique number assigned to each sequence within a taxon. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed. A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/. Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference: Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1 Holozoa (excluding Metazoa) 31 40 39 43 Ctenophora 2 2 35 38 Porifera 4 5 30 47 Placozoa 2 2 3 6 Cnidaria 3 5 65 88 Bilateria 51 51 94 142 Included with the database are: main database files - ready to use LukProt_v1.5.1_single_species_FASTA.7z -- a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence. LukProt_v1.5.1_full_BLAST_db.7z -- a preformatted BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB LukProt_v1.5.1_taxogroup_BLAST_db.7z -- a collection of BLAST databases where each dataset is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB LukProt_v1.5.1_single_species_BLAST_db.7z -- a collection of BLAST databases where each dataset is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB metadata: a README file data manipulation scripts a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience. a spreadsheet with information about each dataset (in an open .ods format, most compatible with LibreOffice) other files - see README changelog Words of caution: The database has been synchronized to EukProt v3 in version v1.5.1. This means that previous identifiers were modified (although the sequence numbers should be the same) Many datasets, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript is not full). For this reason, to get accurate sequences from each organism, the users are directed to source data. The taxonomy is different to UniEuk/EukMap but UniEuk data were integrated where possible. A few NCBI taxids are missing. A number of datasets present in some metadata, are unpublished and were held back. While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established. Acknowledgements: Andrew E. Allen Lab for creating the original PhyloDB. Daniel Richter et al. for creating EukProt and keeping it updated. Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science. All the authors of the original datasets. National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
LukProt是基于EukProt数据库补充新增物种构建的数据集,新增物种多为采样不足的动物类群及部分全域动物(holozoan)类群。如有任何问题或建议,请联系Lukasz Sobala:lukasz.sobala@hirszfeld.pl。
本数据库的序列均来自注释基因组、转录组或表达序列标签(EST)的翻译产物。其核心用途为提供检索资源,用于查询特定蛋白质或结构域在大型演化支中的存在情况,并重构其演化谱系。
当前数据库版本为v1.5.1,其构建基于EukProt v3。LukProt所有公开版本的托管页面均为本页面(Zenodo)。
LukProt中新增的数据集以LPXXXXX为标识,源自AniProtDB的数据集则以APXXXXX为标识。EukProt的序列ID在LukProt中得以保留,具体而言,每条序列的ID格式如下:(A/E/L)PXXXXX_种加词(Species epithet)_(菌株)_PXXXXXX,其中X为0至9的数字。每个分类单元内的每条序列都会被分配一个唯一编号。
所有序列ID均兼容基本局部比对搜索工具(BLAST)v5的"-parse_seqids"参数,该数据库可快速部署,例如在搭载SequenceServer的服务器上使用。在每个原始FASTA文件中,原始序列标识符会以空格分隔的形式保留在新ID之后,以便在需要时仍可检索到原始标识。
可用于检索LukProt的公开BLAST服务器地址为:https://lukprot.hirszfeld.pl/。
以下为EukProt v2、EukProt v3、LukProt v1.4.1及LukProt v1.5.1的主要差异对比:
| 分类类群 | EukProt v2 | EukProt v3 | LukProt v1.4.1 | LukProt v1.5.1 |
|--------|-----------|-----------|----------------|---------------|
| 全域动物(不含后生动物) | 31 | 40 | 39 | 43 |
| 栉水母动物门 | 2 | 2 | 35 | 38 |
| 多孔动物门 | 4 | 5 | 30 | 47 |
| 扁盘动物门 | 2 | 2 | 3 | 6 |
| 刺胞动物门 | 3 | 5 | 65 | 88 |
| 两侧对称动物 | 51 | 51 | 94 | 142 |
数据库附带的主要文件如下:
1. `LukProt_v1.5.1_single_species_FASTA.7z`:含序列的FASTA文件,经7-Zip压缩,解压后大小为17.6 GB。若需将所有文件合并为单个FASTA文件,请在父目录中执行以下命令:
bash
for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done
该命令会在父目录中生成包含所有序列的单FASTA文件。此处使用awk工具在每个文件后插入换行符,以避免cat命令可能出现的“将上一个文件的最后一条序列与下一个文件的头部标题合并”问题。
2. `LukProt_v1.5.1_full_BLAST_db.7z`:预格式化的BLAST数据库(采用NCBI BLAST数据库格式v5,经segmasker完成序列屏蔽),解压后大小为28.3 GB。
3. `LukProt_v1.5.1_taxogroup_BLAST_db.7z`:按分类类群组织的BLAST数据库集合,每个数据集对应一个分类类群,存储于真核生物演化树目录结构中,解压后大小为26.3 GB。
4. `LukProt_v1.5.1_single_species_BLAST_db.7z`:按单物种组织的BLAST数据库集合,每个数据集对应一个物种的BLAST数据库,存储于真核生物演化树目录结构中,解压后大小为26.4 GB。
元数据相关文件包括:
- README说明文档
- 数据处理脚本
- 重着色脚本(由LFS修改,原始版本由Celine Petitjean博士开发):该脚本属于公共领域资源,此处仅为方便用户使用而重新上传
- 各数据集信息统计表(采用开放的.ods格式,与LibreOffice兼容性最佳)
- 其他文件:详见README及变更日志(changelog)
注意事项:
1. v1.5.1版本的LukProt已与EukProt v3完成同步,这意味着旧版序列ID已被修改(但序列编号应保持一致)
2. 许多数据集(尤其是基于转录组构建的数据集)可能存在不同物种的序列污染问题
3. 此外,序列翻译算法常会引入错误(例如转录本序列不完整)
因此,若需获取某一生物体的准确序列,建议用户直接使用原始数据源。
本数据库采用的分类系统与UniEuk/EukMap存在差异,但已尽可能整合UniEuk数据集。部分NCBI分类编号(taxid)存在缺失。部分元数据中提及的数据集尚未正式发表,因此未被包含在本数据库中。
尽管数据库中包含了动物、全域动物及其他真核生物的特定演化关系元数据,但作者并未就此提出任何具体主张或假说。未来待演化支的分类关系得到进一步确认后,将对其进行正式命名。
致谢:
- Andrew E. Allen实验室,感谢其开发原始PhyloDB数据库
- Daniel Richter等学者,感谢其开发并持续维护EukProt数据库
- Multicellgenome实验室全体成员,尤其感谢Michelle Leger(捐赠其自有数据库),感谢其提供的生物信息学支持及优秀的科研工作
- 所有原始数据集的作者
- 波兰国家科学中心,感谢其为项目"糖基化在动物多细胞性起源中的作用"(项目编号:2020/36/C/NZ8/00081)提供资助,该资助为本数据库的构建提供了支持。
创建时间:
2024-02-02



