Experimental results

Spencer Krieger and John Kececioglu
March, 2020

Datasets

Our template databases are drawn from proteins in the protein databank (PDB). For comparison to state-of-the-art methods we use benchmark CASP datasets. All 129 proteins of the CASP10 dataset, all 105 proteins of the CASP11 dataset, all 55 proteins of the CASP12 dataset, and all 49 of the CASP13 dataset were used in our evaluation. The yearly datasets contain a random subset of proteins deposited into PDB each year from 2014 to 2019 (called PDB2014, ..., PDB2019), except PDB2019, which contains all proteins deposited between January 1, 2019 and May 15, 2019.

Accuracy

Hybridization combines the strengths of template- and non-template- based tools. Template-based methods predict accurately when a close template match is found, while non-template-based tools predict well consistently and generalize to new data. With hybridization, we achieve a more robust, higher-accuracy tool than either single tool. Nnessy is the template-based tool in every hybrid, because the hybrid approach needs an accuracy estimator, which other template-based tools lack. The best hybrid method outperforms all other tools by more than 3% on every evaluation dataset, and by more than 10% on CASP13 and PDB2019. Hybridization also raises the accuracy of Nnessy substantially, around 15% on average for the CASP datasets.

Running time

The core method of Nnessy circumvents a traditional homology search, offering a speedup over a PSI-BLAST search of large protein sequence databases. We measured the processing time for proteins in the CASP datasets, and Nnessy takes 1.4 seconds per residue, or on average 5 minutes to process a 200-length protein. PSI-BLAST over Uniref50 (used by SSpro) takes 2.7 seconds per residue, or on average 9 minutes to process a 200-length protein. PSI-BLAST over Uniref90 (used by PSIPRED) takes 9.6 seconds per residue, or 32 minutes on a 200-length protein. Our current implementation is nearly twice as fast as SSpro, and over six times faster than PSIPRED. The nearest neighbor searches of Nnessy are readily parallelized by distributing queries to separate jobs.

Source code

Source code for Nnessy along with documentation is available on GitHub.