Reversible auto-encoding of amino-acid residues in reduced space: an application to predicting DNA-binding proteins

Name: Reversible auto-encoding of amino-acid residues in reduced space: an application to predicting DNA-binding proteins
Creator: Monash University
License: 暂无描述

Research Data Australia2024-12-14 收录

下载链接：

https://researchdata.edu.au/reversible-auto-encoding-binding-proteins/1948748

下载链接

链接失效反馈

官方服务：

资源简介：

There have been a number of recent studies aiming to predict binding sites and other structural and sequence features of proteins using local amino acid sequence as inputs to a machine learning system. This requires representing amino acids in numerical space, which is typically 20 bits per residue. Number of trainable parameters significantly becomes large with the addition of each neighbor information and hence the application of the technique becomes restricted to the prediction of properties for which large amounts of data is available. Thus, there is a need to find alternatives to this type of sparse encoding. Here a method of auto encoding 20-dimensional sparse representation into lower dimensional space is developed with amino-acids in perspective- although the method is general. It is shown that 20-bit sparse encoding could be reduced to 6-dimensional real space without loss of information and to even lower dimensions with varying degrees of information loss. An application to predicting DNA-binding sites was tested to assess the validity of the proposed method and it was observed that auto-encoded neural network prediction was comparable or superior to sparse encoding system. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1 Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

近年来已有多项研究以局部氨基酸序列作为机器学习系统的输入，旨在预测蛋白质的结合位点及其他结构、序列特征。这类方法需要将氨基酸表征为数值空间向量，通常采用每个残基（residue）20比特的编码方案。随着引入的邻域信息增多，可训练参数的规模会显著膨胀，因此该技术的应用被限制在具备大量可用标注数据的属性预测任务中。因此，亟需寻找这类稀疏编码（sparse encoding）方案的替代方法。尽管该方法具备普适性，但本文针对氨基酸场景构建了一种自编码（auto encoding）方案，用于将20维稀疏表征映射至低维空间。研究表明，20比特的稀疏编码可在不损失信息的前提下压缩至6维实值空间；若允许不同程度的信息损失，还可进一步降至更低维度。为验证所提方法的有效性，本文开展了DNA结合位点（DNA-binding sites）预测的应用测试，结果显示，经自编码处理的神经网络预测性能可媲美甚至优于传统稀疏编码系统。本研究收录于《PRIB 2008会议论文集》，可通过以下链接获取：http://dx.doi.org/10.1007/978-3-540-88436-1 贡献者：莫纳什大学（Monash University）信息技术学院吉普斯兰信息技术分校；切蒂（Chetty, Madhu）；艾哈迈德（Ahmad, Shandar）；恩戈姆（Ngom, Alioune）；滕（Teng, Shyh Wei）；第三届国际模式识别协会（IAPR）生物信息学模式识别国际会议（PRIB 2008，2008年澳大利亚墨尔本）；覆盖范围：版权声明：本内容版权归第三届国际模式识别协会生物信息学模式识别国际会议所有，保留所有权利。

提供机构：

Monash University