Nnessy: nearest-neighbor-based prediction of protein secondary structure without searching for homology

Spencer Krieger and John Kececioglu
March, 2020

Overview

Protein secondary structure prediction is a fundamental precursor to many bioinformatics tasks. Nearly all state-of-the-art tools when computing their secondary structure prediction do not explicitly leverage the vast number of proteins whose structure is known. Leveraging this additional information in a so-called template-based method has the potential to significantly boost prediction accuracy.

Methods

We give here source code for accuracy estimators for several state-of-the-art methods for protein secondary structure prediction and a hybrid ensemble between Nnessy and Porter that surpasses the accuracy of any single tool on standard benchmark datasets. We also provide the evaluation datasets referenced in our paper.

Nearest neighbor search
Nnessy takes an input amino acid protein sequence and a preprocessed template database containing proteins with known secondary structure. It splits the input sequence into fixed-length words and finds nearest neighbors from the template database. Nnessy then combines these words in an overlapping words procedure, where the secondary structure state of the template nearest neighbors influence the state probabilities of the input residues they overlap.
Maximum-likelihood approach
The state probabilities are given to a maximum likelihood approach, which instead of greedily choosing the secondary structure state of highest probability at each position, finds the physically-valid secondary structure of maximum likelihood by dynamic programming. This maximum likelihood approach takes into account the state probabilities calculated by the overlapping words procedure, transition probabilities or the likelihood that one state transitions to another state, and run-length probabilities or the likelihood that a run of a specific length occurs in nature. This maximum likelihood approach takes into account minimum run-lengths of secondary structure states, e.g. alpha helices must be at least 5 residues long.
Accuracy estimation
Another aspect of this tool is its accuracy estimator, which uses an output of the prediction process to estimate the unknown true accuracy of its prediction. We use the average distance to the nearest neighbors and map it to estimated accuracy that is output with the prediction.
Results
On PDB datasets, which are designed to evaluate Nnessy's performance on proteins that are added to the Protein databank each year, Nnessy exceeds the state-of-the-art accuracy by 1% for 3-state and 8% for 8-state prediction, achieving 85.7% and 82.4% accuracy. On CASP datasets which are designed to be difficult for template-based methods, Nnessy does not exceed the state of the art.

Publication

The methods implemented in Nnessy are given in the following publication, which should be cited under noteworthy use of Nnessy

Spencer Krieger and John Kececioglu, “Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization”, Proceedings of the 28th Conference on Intelligent Systems for Molecular Biology (ISMB 2020).

Source code

Source code for Nnessy along with documentation is available on GitHub.

Videos

The following video was presented at ISMB 2020 and gives more detailed information on Nnessy:

A shorter version was presented at SCS 2020: