SVM-HUSTLE—an iterative semi-supervised machine learning approach for pairwise protein remote homology detection

Anuj R Shah, Christopher S Oehmen, Bobbie-Jo Webb-Robertson
Bioinformatics 2008 March 15, 24 (6): 783-90

MOTIVATION: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular 'parts list'. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect homology using semi-supervised iterative learning (SVM-HUSTLE) that identifies significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative high-confidence training sets, recruits additional sequences and assigns a statistical measure of homology between a pair of sequences. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine, on-the-fly, patterns indicating homology.

RESULTS: When compared against existing methods for identifying protein homologs (BLAST, PSI-BLAST, COMPASS, PROF_SIM, RANKPROP and their variants) on two different benchmark datasets SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC(1) statistic with P-values less than 1e-20. SVM-HUSTLE also yields results comparable to HHSearch but at a substantially reduced computational cost since we do not require the construction of HMMs.

AVAILABILITY: The software executable to run SVM-HUSTLE can be downloaded from

Full Text Links

Find Full Text Links for this Article


You are not logged in. Sign Up or Log In to join the discussion.

Related Papers

Remove bar
Read by QxMD icon Read

Save your favorite articles in one place with a free QxMD account.


Search Tips

Use Boolean operators: AND/OR

diabetic AND foot
diabetes OR diabetic

Exclude a word using the 'minus' sign

Virchow -triad

Use Parentheses

water AND (cup OR glass)

Add an asterisk (*) at end of a word to include word stems

Neuro* will search for Neurology, Neuroscientist, Neurological, and so on

Use quotes to search for an exact phrase

"primary prevention of cancer"
(heart or cardiac or cardio*) AND arrest -"American Heart Association"