Reversible auto-encoding of amino-acid residues in reduced space: an application to predicting DNA-binding proteins

Name: Reversible auto-encoding of amino-acid residues in reduced space: an application to predicting DNA-binding proteins
Creator: Monash University
Published: 2026-02-12 08:00:52
License: 暂无描述

DataCite Commons2026-02-12 更新2026-05-04 收录

下载链接：

https://bridges.monash.edu/articles/dataset/Reversible_auto-encoding_of_amino-acid_residues_in_reduced_space_an_application_to_predicting_DNA-binding_proteins/5619529/1

下载链接

链接失效反馈

官方服务：

资源简介：

There have been a number of recent studies aiming to predict binding sites and other structural and sequence features of proteins using local amino acid sequence as inputs to a machine learning system. This requires representing amino acids in numerical space, which is typically 20 bits per residue. Number of trainable parameters significantly becomes large with the addition of each neighbor information and hence the application of the technique becomes restricted to the prediction of properties for which large amounts of data is available. Thus, there is a need to find alternatives to this type of sparse encoding. Here a method of auto encoding 20-dimensional sparse representation into lower dimensional space is developed with amino-acids in perspective- although the method is general. It is shown that 20-bit sparse encoding could be reduced to 6-dimensional real space without loss of information and to even lower dimensions with varying degrees of information loss. An application to predicting DNA-binding sites was tested to assess the validity of the proposed method and it was observed that auto-encoded neural network prediction was comparable or superior to sparse encoding system. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1 Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

近年来已有多项研究以局部氨基酸序列作为机器学习系统的输入，旨在预测蛋白质的结合位点（binding site）以及其他结构与序列特征。该任务需要将氨基酸表示为数值空间中的向量，通常采用每个残基20比特的编码方式。随着每新增一组邻域信息，可训练参数的数量会显著增加，因此该技术的应用范围被限制在拥有大量可用数据的属性预测任务中。为此，亟需寻找这类稀疏编码（sparse encoding）的替代方案。本文提出了一种将20维稀疏表示自编码（auto encoding）至低维空间的方法，该方法以氨基酸为研究视角——尽管其具备通用性。研究表明，20比特的稀疏编码可在不损失信息的前提下压缩至6维实数值空间，且可进一步压缩至更低维度，但会伴随不同程度的信息损失。为验证所提方法的有效性，本文将其应用于DNA结合位点（DNA-binding sites）预测任务，结果显示，经自编码的神经网络预测性能可与稀疏编码系统媲美，甚至更优。相关研究成果收录于2008年第三届IAPR国际生物信息学模式识别会议（PRIB 2008）论文集，获取链接：http://dx.doi.org/10.1007/978-3-540-88436-1 贡献方：莫纳什大学（Monash University）信息技术学院吉普斯兰信息科技学院；切蒂, 马杜（Chetty, Madhu）；艾哈迈德, 尚达尔（Ahmad, Shandar）；恩戈姆, 阿利乌内（Ngom, Alioune）；滕, 施伟（Teng, Shyh Wei）；第三届IAPR国际生物信息学模式识别会议（PRIB 2008，2008年，澳大利亚墨尔本）；数据覆盖范围：版权声明：本内容版权归第三届IAPR国际生物信息学模式识别会议所有，保留所有权利。

提供机构：

Monash University

创建时间：

2026-02-11