A Method to Generate Complex Predictive Features for ML-Based Prediction of the Local Protein Structure

Y. V. Milchevskiy; Мильчевский Ю. В.; V. Y. Milchevskaya; Мильчевская В. Ю.; Y. V. Kravatsky; Кравацкий Ю. В.

doi:10.31857/S0026898423010093

A Method to Generate Complex Predictive Features for ML-Based Prediction of the Local Protein Structure

Authors: Milchevskiy Y.V.¹, Milchevskaya V.Y.¹^,2, Kravatsky Y.V.¹^,3
Affiliations:
1. Engelhardt Institute of Molecular Biology, Russian Academy of Sciences
2. Institute of Medical Statistics and Bioinformatics, Faculty of Medicine, University of Cologne
3. Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences
Issue: Vol 57, No 1 (2023)
Pages: 127-138
Section: БИОИНФОРМАТИКА
URL: https://journals.rcsi.science/0026-8984/article/view/138629
DOI: https://doi.org/10.31857/S0026898423010093
EDN: https://elibrary.ru/AWNZLZ
ID: 138629

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

Recently, the prediction of protein structure and function from its sequence underwent a rapid increase in performance. It is primarily due to the application of machine learning methods, many of which rely on the predictive features supplied to them. It is thus crucial to retrieve the information encoded in the amino acid sequence of a protein. Here, we propose a method to generate a set of complex yet interpretable predictors, which aids in revealing factors that influence protein conformation. The proposed method allows us to generate predictive features and test them for significance in two scenarios: for a general description of the protein structures and functions, as well as for highly specific predictive tasks. Having generated an exhaustive set of predictors, we narrow it down to a smaller curated set of informative features using feature selection methods, which increases the performance of subsequent predictive modelling. We illustrate the effectiveness of the proposed methodology by applying it in the context of local protein structure prediction, where the rate of correct prediction for DSSP Q3 (three-class classification) is 81.3%. The method is implemented in C++ for command line use and can be run on any operating system. The source code is released on GitHub: https://github.com/Milchevskiy/protein-encoding-projects.

Keywords

local structure prediction, protein secondary structure prediction, protein function, protein sequence encoding, protein conformation, stepwise regression, stepwise discriminant analysis

References

Anfinsen C.B. (1973) Principles that govern the folding of protein chains. Science. 181, 223‒230.
Yang Y., Gao J., Wang J., Heffernan R., Hanson J., Paliwal K., Zhou Y. (2018) Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief. Bioinform. 19, 482‒494.
Zimmermann O., Hansmann U.H. (2008) LOCUSTRA: accurate prediction of local protein structure using a two-layer support vector machine approach. J. Chem. Inf. Model. 48, 1903‒1908.
Wuyun Q., Zheng W., Peng Z., Yang J. (2018) A large-scale comparative assessment of methods for residue-residue contact prediction. Brief. Bioinform. 19, 219‒230.
Zhang J., Kurgan L. (2018) Review and comparative assessment of sequence-based predictors of protein-binding residues. Brief Bioinform. 19, 821‒837.
Min S., Lee B., Yoon S. (2017) Deep learning in bioinformatics. Brief. Bioinform. 18, 851‒869.
Hu H.J., Pan Y., Harrison R., Tai P.C. (2004) Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans Nanobioscience. 3, 265‒271.
Yoo P.D., Sikder A.R., Zhou B.B., Zomaya A.Y. (2008) Improved general regression network for protein domain boundary prediction. BMC Bioinformatics. 9(Suppl. 1), S12.
Miyazawa S., Jernigan R.L. (1999) Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins. 34, 49‒68.
Lin K., May A.C., Taylor W.R. (2002) Amino acid encoding schemes from protein structure alignments: multi-dimensional vectors to describe residue types. J. Theor. Biol. 216, 361‒365.
Asgari E., Mofrad M.R. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One. 10, e0141287.
Jing X., Dong Q., Hong D., Lu R. (2020) Amino acid encoding methods for protein sequences: a comprehensive review and assessment. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 1918‒1931.
Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M. (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202‒205.
Milchevskaya V., Nikitin A.M., Lukshin S.A., Filatov I.V., Kravatsky Y.V., Tumanyan V.G., Esipova N.G., Milchevskiy Y.V. (2021) Structural coordinates: a novel approach to predict protein backbone conformation. PLoS One. 16, e0239793.
Taha K., Yoo P.D. (2015) Predicting protein function from biomedical text. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2015, 3275‒3278.
Dayhoff M.O. (1972) Atlas of protein sequence and structure. Silver Spring, Md.: National Biomedical Research Foundation.
de Brevern A.G., Etchebest C., Hazout S. (2000) Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins. 41, 271‒287.
Kabsch W., Sander C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 22, 2577‒2637.
Hocking R.R. (1983) Developments in linear regression methodology: 1959‒1982. Technometrics. 25, 219‒223.
Ralston A., Wilf H.S., Enslein K. (1960) Mathematical methods for digital computers. New York: Wiley.
Wertz D.H., Scheraga H.A. (1978) Influence of water on protein structure. An analysis of the preferences of amino acid residues for the inside or outside and for specific conformations in a protein molecule. Macromolecules. 11, 9‒15.
Wang G., Dunbrack R.L., Jr. (2005) PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 33, W94‒98.
Cuff J.A., Barton G.J. (1999) Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins. 34, 508‒519.
Rout S.B., Mishra S., Sahoo S.K. (2021) Q3 Accuracy and SOV measure analysis of application of GA in protein secondary structure prediction. Revue d’Intelligence Artificielle. 35, 403‒408.
Yang Y., Heffernan R., Paliwal K., Lyons J., Dehzangi A., Sharma A., Wang J., Sattar A., Zhou Y. (2017) SPIDER2: a package to predict secondary structure, accessible surface area, and main-chain torsional angles by deep neural networks. Methods Mol. Biol. 1484, 55‒63.
Drozdetskiy A., Cole C., Procter J., Barton G.J. (2015) JPred4: a protein secondary structure prediction server. Nucleic Acids Res. 43, W389‒394.
Xie S., Li Z., Hu H. (2018) Protein secondary structure prediction based on the fuzzy support vector machine with the hyperplane optimization. Gene. 642, 74‒83.
Magnan C.N., Baldi P. (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics. 30, 2592‒2597.
Ma Y., Liu Y., Cheng J. (2018) Protein secondary structure prediction based on data partition and semi-random subspace method. Sci. Rep. 8, 9856.
Guo Z., Hou J., Cheng J. (2021) DNSS2: improved ab initio protein secondary structure prediction using advanced deep learning architectures. Proteins. 89, 207‒217.
Wang S., Peng J., Ma J., Xu J. (2016) Protein secondary structure prediction using deep convolutional neural fields. Sci. Rep. 6, 18962.
Zhang B., Li J., Lu Q. (2018) Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinformatics. 19, 293.
Krieger S., Kececioglu J. (2020) Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization. Bioinformatics. 36, i317‒i325.