小蛋白数据库SmProt的更新升级

国家青藏高原科学数据中心2023-02-09 更新2024-03-01 收录

下载链接：

https://data.tpdc.ac.cn/zh-hans/data/6bb5f520-f5d6-449c-8d87-81ae3f78980f

下载链接

链接失效反馈

官方服务：

资源简介：

小蛋白是长度小于100个氨基酸的蛋白质的总称。 SmProt 包含由基因编码的小蛋白的记录，尤其是来自 UTR 和非编码 RNA 的小蛋白。选定的小蛋白是通过核糖体分析数据、文献、质谱 (MS) 等在人、小鼠等物种中进行鉴定。此外，SmProt 还包含所收集的小蛋白质的序列特征、基因组位置、组织/细胞系、反映编码潜力的评估、功能、变异以及已被验证或预测的相关疾病等。特别关注可靠性、变异、小蛋白与疾病之间的关系、组织/细胞系/数据集的大量增加、翻译起始、PhyloCSF 分数、翻译水平和其他详细信息。数据处理及质控上，从 GEO 和 ENA 数据库下载了 547 个 Ribo-seq 数据集的 fastq 文件。手动检查每个数据集以确认测序接头。a使用 cutadapt 1.18 删除接头，只保留了长度为 25-35 bp 的读数。然后使用 STAR 2.5.2a，使用 EndToEnd 模式将序列映射到最新的基因组，允许最多 2 个错配。Ribo-seq 质量和 P 位点偏移由 Ribo-TISH 质量模块评估。对于 TI-seq 数据，更多关注 TIS 质量 (-t)。然后进行手动检查以验证偏移值并消除没有明显三重态周期性的数据集。质量控制后，保留了 419 个 Ribo-seq 数据集。翻译的 ORF 由 Ribo-TISH 预测模块预测。合并一个数据集中相同处理的生物学和技术重复数据。候选 ORF 的最小氨基酸长度设置为 5。考虑到 ATG 和近同源起始密码子（与 ATG 有一个碱基不同），rRibo-seq 数据集仅使用 CHX 而没有匹配的 TI-seq 数据被分析两次。一种是预测具有规范 ATG 起始密码子的 ORF，另一种是预测具有近同源起始密码子 (--alt) 的 ORF。在我们的数据库中更喜欢数据证据而不是先前的假设，只报告了同一 ORF 中多个候选起始密码子的最佳框架测试结果（--framebest）。对于包含 TI-seq 数据的数据集，包括替代起始密码子 (--alt)，并为基于 LTM 的 TI-seq 和基于 HARR 的 TI-seq (--harr) 设置了不同的参数。

Small proteins are a collective term for proteins with a length of less than 100 amino acids. SmProt contains records of gene-encoded small proteins, particularly those derived from untranslated regions (UTRs) and non-coding RNAs (ncRNAs). Selected small proteins have been identified in species such as humans and mice using ribosome profiling data, literature reports, mass spectrometry (MS), and other methods. In addition, SmProt also includes collected information on small proteins, including their sequence features, genomic locations, tissue/cell line distributions, evaluations reflecting coding potential, functions, genetic variations, and verified or predicted associated diseases. Particular emphasis is placed on reliability, genetic variations, the relationship between small proteins and diseases, the substantial expansion of tissue/cell line/dataset resources, translation initiation, PhyloCSF scores, translation levels, and other detailed information. For data processing and quality control, fastq files of 547 Ribo-seq datasets were downloaded from the GEO and ENA databases. Each dataset was manually inspected to confirm the sequencing adapters. Cutadapt 1.18 was used to remove the adapters, retaining only reads with a length of 25–35 bp. Subsequently, the sequences were mapped to the latest reference genome using STAR 2.5.2a in EndToEnd mode, with a maximum of 2 mismatches allowed. The quality of Ribo-seq data and P-site offset were evaluated using the quality assessment module of Ribo-TISH. For TI-seq datasets, more attention was paid to translation initiation site (TIS) quality (-t). Manual checks were then performed to validate the offset values and eliminate datasets without evident triplet periodicity. After quality control, 419 Ribo-seq datasets were retained. Translated open reading frames (ORFs) were predicted using the prediction module of Ribo-TISH. Biological and technical replicates with the same treatment within a single dataset were merged. The minimum amino acid length for candidate ORFs was set to 5. Considering ATG and near-cognate start codons (those differing from ATG by a single nucleotide), rRibo-seq datasets treated only with cycloheximide (CHX) and without matched TI-seq data were analyzed twice. One analysis predicted ORFs with the canonical ATG start codon, while the other predicted ORFs with near-cognate start codons (--alt). In our database, data evidence is prioritized over prior hypotheses, and only the best frame test results (--framebest) for multiple candidate start codons within the same ORF are reported. For datasets containing TI-seq data, alternative start codons (--alt) were included, and distinct parameters were set for LTM-based TI-seq and HARR-based TI-seq (--harr).

提供机构：

陈润生

创建时间：

2023-01-17

搜集汇总

数据集介绍

背景与挑战

背景概述

小蛋白数据库SmProt是一个专注于长度小于100个氨基酸的蛋白质的数据库，特别关注由基因编码的小蛋白，尤其是来自UTR和非编码RNA的小蛋白。该数据库通过核糖体分析数据、文献、质谱等方法在人、小鼠等物种中鉴定小蛋白，并包含序列特征、基因组位置、功能、变异等相关信息。数据处理上，从GEO和ENA数据库下载了547个Ribo-seq数据集，经过严格的质量控制和手动检查，最终保留了419个数据集用于预测翻译的ORF。

以上内容由遇见数据集搜集并总结生成