[Source Code on GitHub]
Background: Protein secondary structure prediction is a fundamental precursor to many bioinformatics tasks. Nearly all state-of-the-art tools when computing their secondary structure prediction do not explicitly leverage the vast number of proteins whose structure is known. Leveraging this additional information in a so-called template-based method has the potential to significantly boost prediction accuracy.
Nearest neighbor search: Nnessy takes an input amino acid protein sequence and a preprocessed template database containing proteins with known secondary structure. It splits the input sequence into fixed-length words and finds nearest neighbors from the template database. Nnessy then combines these words in an overlapping words procedure, where the secondary structure state of the template nearest neighbors influence the state probabilities of the input residues they overlap.
Maximum-likelihood approach: The state probabilities are given to a maximum likelihood approach, which instead of greedily choosing the secondary structure state of highest probability at each position, finds the physically-valid secondary structure of maximum likelihood by dynamic programming. This maximum likelihood approach takes into account the state probabilities calculated by the overlapping words procedure, transition probabilities or the likelihood that one state transitions to another state, and run-length probabilities or the likelihood that a run of a specific length occurs in nature. This maximum likelihood approach takes into account minimum run-lengths of secondary structure states, e.g. alpha helices must be at least 5 residues long.
Accuracy estimation: Another aspect of this tool is its accuracy estimator, which uses an output of the prediction process to estimate the unknown true accuracy of its prediction. We use the average distance to the nearest neighbors and map it to estimated accuracy that is output with the prediction.
Results: On PDB datasets, which are designed to evaluate Nnessy's performance on proteins that are added to the Protein databank each year, Nnessy exceeds the state-of-the-art accuracy by 1% for 3-state and 8% for 8-state prediction, achieving 85.7% and 82.4% accuracy. On CASP datasets which are designed to be difficult for template-based methods, Nnessy does not exceed the state of the art.
A hybrid approach which combines Nnessy with a template-free tool Porter exceeds the state of the art by nearly 4% for 3-state and 8% for 8-state on all CASP and PDB datasets. This approach is available at http://ssylla.cs.arizona.edu
Citation: Noteworthy uses of Nnessy should cite the following publication: Spencer Krieger and John Kececioglu, “Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization”, Proceedings of the 28th Conference on Intelligent Systems for Molecular Biology (ISMB 2020).
Funding: Research supported by the US National Science Foundation through grants IIS-1217886 and CCF-1617192.