Read 2006_12_19_phd_thesis_print.pdf text version

1

Development of an Enhanced Fold Recognition Ensemble System for Protein Structure Prediction

Riccardo Matja Bennett-Lovsey z

September 2006

A dissertation submitted for the degree of Doctor of Philosophy of the University of London and for the diploma of Imperial College of Science, Technology and Medicine

Structural Bioinformatics Group Faculty of Natural Sciences Division of Molecular Biosciences Imperial College London South Kensington London, SW7 2AZ

2

Abstract Recognition of protein homology is an effective method for to assigning a putative function to an uncharacterised protein or gene. Recent trends show that Meta servers (ensemble recognition systems) have superior performance when compared to individual recognition methods, however no systematic analysis has been performed to show why these systems perform so well and how they can be further improved. This thesis describes the complete systematic development and benchmarking of an ensemble system for protein fold recognition, and examines the reasons behind the resultant improvements in performance. A software suite for protein sequence analysis was developed to carry out a wide variety of fold recognition methods, searching a database of known structures. These methods include profile-profile, secondary structure, and structure-specific gapped alignment algorithms. Each of these methods was optimised and tested using stringently selected protein data sets consisting of disparate subsets of the Structural Classification of Proteins (SCOP) database. An analysis of the different methodologies used for the selection and combination of the fold recognition classifiers is presented, together with a benchmark of the effect of ensemble systems on recognition accuracy. Data from testing protein fold and superfamily clustering, structural model clustering, Bagging, Boosting, and support vector machines under various conditions are also discussed. A thorough benchmarking procedure is applied to optimised ensemble systems: in comparison to the improvement that the single best fold recognition algorithm has over PSIBLAST, the best performing ensemble identifies 29.6% more correct homologous query-template relationships, and correctly annotates 46.2% more queries, at 95% precision or higher. Analyses show that this increase in recognition accuracy is largely due to the `noise filtering' effect of using multiple recognition algorithms in a consensus approach, i.e. there is a greater likelihood of being consistently right than being consistently wrong.

CONTENTS

3

Contents

1 Introduction 1.1 1.2 1.3 16

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Motivations behind Protein Structure Prediction . . . . . . . . . . . . 17 Homology and Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.3.1 Structure-based Classification Of Proteins -- SCOP . . . . . . 20

1.4

Sequence-based Recognition Methods . . . . . . . . . . . . . . . . . . 23 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.4.6 1.4.7 1.4.8 Homology Modelling . . . . . . . . . . . . . . . . . . . . . . . 23 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 24 Substitution Matrices . . . . . . . . . . . . . . . . . . . . . . . 30 The FASTA and BLAST Heuristic Methods . . . . . . . . . . 34 Alignment Statistics . . . . . . . . . . . . . . . . . . . . . . . 39 Position Specific Scoring Matrices and Profiles . . . . . . . . . 40 The PSI-BLAST Heuristic Method . . . . . . . . . . . . . . . 42 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 43

1.5

Fold Recognition and Threading . . . . . . . . . . . . . . . . . . . . . 48 1.5.1 1.5.2 1.5.3 Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Fold Recognition Using Profiles . . . . . . . . . . . . . . . . . 51 Critical Assessment of Techniques for Protein Structure Prediction (CASP) -- The Development of Fold Recognition . . . 54 1.5.4 Critical Assessment of Fully Automated Structure Prediction -- CAFASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 1.5.5 The `3D-PSSM' Server . . . . . . . . . . . . . . . . . . . . . . 59

CONTENTS

4

1.6

CASP5 -- Fold Recognition with Ensemble Systems . . . . . . . . . . 63 1.6.1 1.6.2 1.6.3 Meta Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Evaluation of Fold Recognition Predictions . . . . . . . . . . . 65 The State-of-the-Art -- the Results of CASP5 for Fold Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

1.7

Ensemble Theory -- Links to Fold Recognition 1.7.1 1.7.2 1.7.3 1.7.4 1.7.5 Protein Fold Recognition Ensembles

. . . . . . . . . . . . 83

. . . . . . . . . . . . . . 84

Ensemble Notation . . . . . . . . . . . . . . . . . . . . . . . . 89 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Boosting - AdaBoost . . . . . . . . . . . . . . . . . . . . . . . 91 Support Vector Machines -- SVMs . . . . . . . . . . . . . . . 93

1.8

Scope and Outline of this Thesis . . . . . . . . . . . . . . . . . . . . . 96 99

2 Development of `Phyre' 2.1 2.2 2.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Overview of `Phyre' Development . . . . . . . . . . . . . . . . . . . . 101 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 Designing the Assessment . . . . . . . . . . . . . . . . . . . . 101 `Dynamic' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Analysis of Individual Recognition Algorithms . . . . . . . . . 104 Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . . . 104 The Final `Phyre' System . . . . . . . . . . . . . . . . . . . . 106

2.4

Fold Library and Data Sets . . . . . . . . . . . . . . . . . . . . . . . 107 2.4.1 2.4.2 Fold Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Building Training and Testing Data . . . . . . . . . . . . . . . 113

2.5

Benchmarking Assessment . . . . . . . . . . . . . . . . . . . . . . . . 116 2.5.1 2.5.2 2.5.3 Average Precision -- Metric of Recognition Quality . . . . . . 117 Simplex Method for Function Minimisation . . . . . . . . . . . 119 Empirical Precision -- Standardised Scoring Framework . . . 120

CONTENTS

5

3 Assessment and Optimisation of Recognition Algorithms Using `Dynamic' 3.1 3.2 3.3 126

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 `Dynamic' Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 128 3.3.1 3.3.2 3.3.3 3.3.4 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Structure Specific Gaps . . . . . . . . . . . . . . . . . . . . . . 129 Probability and Log-odds Score Extrapolation . . . . . . . . . 129 Calculating E-Values . . . . . . . . . . . . . . . . . . . . . . . 131

3.4

Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.4.1 3.4.2 3.4.3 Sequence-Sequence Comparison . . . . . . . . . . . . . . . . . 132 Profile-Sequence and Sequence-Profile Comparison . . . . . . . 133 Profile-Profile Comparisons . . . . . . . . . . . . . . . . . . . 133

3.5

Benchmarking Results . . . . . . . . . . . . . . . . . . . . . . . . . . 142 3.5.1 Methods 001 to 006 (sequence-sequence and sequence-profile methods) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 3.5.2 3.5.3 3.5.4 Methods 007 to 021 (profile-profile methods) . . . . . . . . . . 156 Methods 022 to 031 (profile-profile with secondary structure) . 159 ROC Analysis of Methods . . . . . . . . . . . . . . . . . . . . 162

3.6

Benchmarking Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 164

4 Development and Optimisation of an Enhanced Fold Recognition Ensemble 4.1 4.2 170

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 4.2.1 4.2.2 State-of-the-Art Ensembles . . . . . . . . . . . . . . . . . . . . 172 A New Approach to Fold Recognition Ensembles . . . . . . . 173

4.3

Bagging and Boosting Ensembles . . . . . . . . . . . . . . . . . . . . 173 4.3.1 4.3.2 Bagging Benchmarking . . . . . . . . . . . . . . . . . . . . . . 174 Boosting Benchmarking . . . . . . . . . . . . . . . . . . . . . 175

CONTENTS

6

4.3.3 4.4 4.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

CASP-like Training and Testing . . . . . . . . . . . . . . . . . . . . . 182 Support Vector Machine Clustering . . . . . . . . . . . . . . . . . . . 185 4.5.1 4.5.2 4.5.3 4.5.4 SVM 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 SVM 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 SVM 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 194

4.6

3D-JURY and 3D-COLONY Clustering . . . . . . . . . . . . . . . . . 196 4.6.1 4.6.2 4.6.3 4.6.4 3D-JURY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 3D-COLONY . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Constructing Ensembles . . . . . . . . . . . . . . . . . . . . . 199 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 202

4.7

Final Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 205 4.7.1 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 205 217

5 Discussion 5.1 5.2 5.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Final Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 5.3.1 5.3.2 5.3.3 Improved Precision . . . . . . . . . . . . . . . . . . . . . . . . 218 Improved Model Quality . . . . . . . . . . . . . . . . . . . . . 219 Sequence/Structural Features of Remote Homologies . . . . . 221 223

6 Conclusions 6.1 6.2 6.3 6.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Server Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 228

Acknowledgements

CONTENTS

7

Appendices A Alignment Statistics

230 231

A.1 Derivation of E-values . . . . . . . . . . . . . . . . . . . . . . . . . . 231 A.2 Maximum Likelihood Fitting of Extreme Value Distributions . . . . . 236 A.3 Fitting Censored Data to Extreme Value Distributions . . . . . . . . 240 B Constructing a Profile with PSI-BLAST 243

B.1 Multiple Alignment Construction . . . . . . . . . . . . . . . . . . . . 243 B.2 Sequence Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 B.3 Target Frequency Estimation . . . . . . . . . . . . . . . . . . . . . . 245 C Training and Testing Sets 247

C.1 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 C.2 Testing Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 References 258

LIST OF FIGURES

8

List of Figures

1.1 1.2 1.3 Sequence-to-structure-function paradigm . . . . . . . . . . . . . . . . 19 The SCOP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Diagram showing the repeated step in the calculation of a dynamic programming alignment matrix . . . . . . . . . . . . . . . . . . . . . 26 1.4 1.5 1.6 1.7 1.8 1.9 An example of a global dynamic programming alignment matrix . . . 27 An example of a local dynamic programming alignment matrix . . . . 29 The BLOSUM62 amino acid substitution matrix . . . . . . . . . . . . 33 The FASTA method . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 The BLAST method . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 The PSI-BLAST method . . . . . . . . . . . . . . . . . . . . . . . . . 44

1.10 A two state hidden Markov model . . . . . . . . . . . . . . . . . . . . 46 1.11 The structure of a Profile HMM . . . . . . . . . . . . . . . . . . . . . 47 1.12 Flow diagram of the `3D-PSSM' system . . . . . . . . . . . . . . . . . 62 1.13 Three fundamental reasons why an ensemble may work better than a single classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 1.14 A 2D example of a decision algorithm in an SVM . . . . . . . . . . . 94 1.15 An example of higher dimensional mapping by an SVM . . . . . . . . 95 2.1 2.2 2.3 2.4 2.5 Protein structure prediction methods . . . . . . . . . . . . . . . . . . 102 An illustration of a query-template comparison in the `Phyre' system 108 An example illustrating the simplex search algorithm . . . . . . . . . 121 An example of training data labelled with actual precision values . . 123

An example of testing data labelled with empirical precision values . 124

LIST OF FIGURES

9

3.1 3.2 3.3 4.1

Dynamic programming alignment algorithms . . . . . . . . . . . . . . 134 ROC50 analysis of all optimised `Dynamic' classifiers . . . . . . . . . 165 ROC100 analysis of all optimised `Dynamic' classifiers . . . . . . . . . 166 An illustrated example of complementarity between classifiers in an ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

4.2 4.3 4.4 4.5 4.6 4.7

Illustration showing the construction of CASP-like data . . . . . . . . 186 An illustration of CASP-like data . . . . . . . . . . . . . . . . . . . . 187 Diagrammatic representation of the first fold recognition SVM . . . . 190 Diagrammatic representation of the second fold recognition SVM . . . 193 Schematic illustration of 3D-JURY and 3D-COLONY ensembles . . . 200 Benchmarking recall at 95% precision using CASP-like 3D-JURY and 3D-COLONY ensembles . . . . . . . . . . . . . . . . . . . . . . . . . 213

4.8

Percentage recall found and queries correctly annotated at 95% precision for all 3D-JURY and 3D-COLONY ensembles . . . . . . . . . . 214

4.9

McNemar's test analyses of full testing results from all SVM, 3DJURY, and 3D-COLONY ensembles . . . . . . . . . . . . . . . . . . . 215

5.1

Histogram illustrating the difference between the true positives and false positives shared across the 10 algorithms in the best ensemble . 220

5.2

Histogram showing the RMSD of high confidence models produced by the best ensemble compared to those produced by the single best method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

LIST OF TABLES

10

List of Tables

1.1 1.2 2.1 3.1 3.2 Selected evaluation measures used to assess the quality of 3D models 67

Top 20 predictors from CASP5, ranked by combined scores (all domains) 74 Rossman and Rossman-like folds and superfamilies . . . . . . . . . . 116 Profile-profile comparison algorithms in `Dynamic' . . . . . . . . . . . 135 Optimal search parameters and testing results for a variety of dynamic programming recognition methods . . . . . . . . . . . . . . . . . . . . 143

3.3

Testing results for a variety of optimised dynamic programming recognition methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

4.1 4.2 4.3

CASP-like SVM ensemble benchmarking results . . . . . . . . . . . . 196 CASP-like 3D-JURY and 3D-COLONY ensemble benchmark results . 209 Ensemble and single-method benchmarking results using the full testing set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

C.1 Training set SCOP Unique Identifiers, fold names, and superfamily names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 C.2 Testing set SCOP Unique Identifiers, fold names, and superfamily names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

LIST OF ALGORITHMS

11

List of Algorithms

1 2 The Bagging Algorithm The AdaBoost Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 90 . . . . . . . . . . . . . . . . . . . . . . . . 92

LIST OF ABBREVIATIONS

12

List of Abbreviations

1D-PSSM 2D-PSSM 3D-COLONY One dimensional PSSM. Two dimensional PSSM. A structural clustering algorithm designed to reflect structural entropy (not an abbreviation). 3D-GENOMICS A database for processing and administering large amounts of genomic and proteomic data (not an abbreviation). 3D-JIGSAW A comparative modelling server (not an abbreviation). 3D-JURY A structural clustering algorithm (not an abbreviation). 3D-PSSM `3D-PSSM' AdaBoost AEROSPACI AP ASTRAL Three dimensional PSSM. A threading server (not an abbreviation). Adaptive Boosting. Aberrant Entry Re-Ordered SPACI. Average Precision. Sequence and structure database, supplement to SCOP (not an abbreviation). BASIC Bilaterally Amplified Sequence Information Comparison. B-DHIP BLAST Bi-Directional Heterogeneous Inner Product. Basic Local Alignment Search Tool.

LIST OF ABBREVIATIONS

13

BLOCKS

A database of sequence alignment blocks (not an abbreviation).

BLOSUM CAFASP

BLOCKS SUbstitution Matrix. Critical Assessment of Fully Automated Structure Prediction.

CASP

Critical Assessment of techniques for protein Structure Prediction.

CATH

Class(C), Architecture(A), Topology(T), and Homologous superfamily (H) (a structural classication of proteins).

CE CM DALI DSSP EP EsyPred3D

Combinatorial Extension. Comparative Modelling. Distance matrix ALIgnment. Definition of Secondary Structure of Proteins. Empirical Precision. A comparative modelling server (not an abbreviation).

EVD FASTA FN FP FR FR(A) FR(H) GDT TS HMM HMMer

Extreme Value Distribution. Fast Alignment Search Tool. False Negative. False Positive. Fold Recognition. Fold Recognition of Analogues. Fold Recognition of Homologues. Global Distance Test Total Score. Hidden Markov Model. Hidden Markov Model software package (not an abbreviation).

HSP

High-scoring Segment Pair.

LIST OF ABBREVIATIONS

14

Indel

An insertion or deletion (i.e. gap) within a sequence alignment.

InterPro

Integrated resource of Protein families, domains and functional sites.

LGA LG-score MAMMOTH

Local-Global Alignment. Levitt-Gerstein score. MAtching Molecular Models Obtained from THeory.

MaxSub

A structural superposition program that calculates a variant of the LG-score (not an abbreviation).

MSP NF NMR PAM PDB PDF `Phyre' PMF PROCHECK

Maximal-scoring Segment Pair. New Fold. Nuclear Magnetic Resonance. Point Accepted Mutation. Protein Data Bank. Probability Density Function. Protein Homology/analogY Recognition Engine. Probability Mass Function. A suite of programs to assess the stereochemical quality of a given protein structure (not an abbreviation).

PROF SIM

A profile-profile alignment algorithm (not an abbreviation).

ProFit PROSITE

Protein least-squares Fitting. A protein sequence pattern database (not an abbreviation).

PSI-BLAST PSIPRED PSSM

Position Specific Iterated BLAST. Position Specific Iterated PREDiction. Position Specific Scoring Matrix.

LIST OF ABBREVIATIONS

15

QT RBF RMS RMSD ROC SAM-T2K

Query-Template. Radial Basis Function. Root Mean Square. Root Mean Square Deviation. Receiver Operating Characteristic. Sequence Alignment and Modelling software (2000).

SAM-T99

Sequence Alignment and Modelling software (1999).

SCOP SMART SOV O SPACI STRIDE

Structural Classication Of Proteins. Simple Modular Architecture Research Tool. Segment Overlap Measure Observed score. Summary PDB ASTRAL Check Index. Protein secondary structure assignment from atomic coordinates (not an abbreviation).

SVM SWISS-MODEL

Support Vector Machine. An automated comparative modelling server (not an abbreviation).

THREADER TM TN TP VERIFY3D

A threading algorithm (not an abbreviation). Template Modelling. True Negative. True Positive. A tool designed to help in the refinement of crystallographic structures (not an abbreviation).

WHAT CHECK

A suite of programs to assess the stereochemical quality of a given protein structure (not an abbreviation).

Introduction

16

Chapter 1 Introduction

1.1 Summary

This chapter presents the background information necessary to describe the development of an ensemble system of fold recognition servers for protein structure prediction. § 1.2 gives a brief outline of the main motivations behind the field of protein structure prediction and why it is considered so important. § 1.3 explores the definitions of protein homology and analogy, and details how these concepts were used in the development of the SCOP database. § 1.4 examines the underlying principles and techniques used in sequence-only homology modelling methods, such as: dynamic programming, the FASTA and BLAST procedures, sequence-based profiles, and Hidden Markov Models. § 1.5 charts the development of fold recognition/threading techniques and outlines the use of structural information to enhance protein structure prediction. This section also describes the development of the Critical Assessment of Techniques for Protein Structure Prediction (CASP) over the past decade and the success of the `3D-PSSM' server at CASP4. § 1.6 details the results of the CASP5 meeting in 2002, and the establishment of the Meta server technique as a powerful means for fold recognition. § 1.7 outlines the

Introduction

17

theories of computational ensembles that were used during this research as part of the development of an ensemble system for enhanced fold recognition. Finally, § 1.8 details the scope and overall structure of this thesis.

1.2

Motivations behind Protein Structure Prediction

Advances in DNA sequencing techniques, during the 1990s and into the 21st century, have provided the scientific community with a wealth of sequence information (International Human Genome Sequencing Consortium, 2001). Over the past several decades, the number of proteins being successfully sequenced per annum has grown exponentially year on year. Likewise, the number of new protein structures being determined experimentally per annum has also grown exponentially. At the time of writing, the number of publicly available genomes was 525, including 100 eukaryotic genomes (data from Integrated Genomics Inc., http://wit.integratedgenomics. com/ERGO_supplement/genomes.html; Overbeek et al., 2003), which provided approximately 3 million different protein sequences. However, the number of known protein structures held in the Protein Data Bank (PDB; Berman et al., 2000) was 31,123: two orders of magnitude less. Since the ultimate aim of characterising a protein is to determine its structure and function, the bottleneck that has occurred between sequence determination and structure determination has also created a need for computational techniques capable of suggesting possible structures and functions for the products of newly sequenced genes (Durbin et al., 1998; Jaroszewski et al., 2002). The ability to successfully predict the probable structure and function of a novel protein can greatly reduce the time expended in analysis by providing initial hypotheses for experimental tests.

Introduction

18

1.3

Homology and Analogy

An effective method for annotating a newly sequenced protein is through inference from an homologous (i.e. related) protein (or family of proteins) of known structure and/or function (Shapiro & Harris, 2000). There are publicly available services that provide powerful tools for annotating new sequences, such as InterPro (Mulder et al., 2005). By definition, two genes or proteins that have diverged from a common ancestor are known as homologues. Homologues that occur as a result of gene duplication within a species are known as paralogues (e.g. the human immunoglobulin family), and homologues that occur in separate species (as a result of speciation) are known as orthologues (e.g. primate haemoglobins). Homologues share common folds and have varying degrees of secondary and tertiary structural similarity, depending on when they diverged from the common ancestor (Chothia & Lesk, 1986). Therefore, if one can reliably predict that a newly sequenced gene product is related to a well characterised protein, then this may be a helpful step in determining its potential structure and function. This method for annotating a newly sequenced protein or gene product is sometimes referred to as the sequence-to-structure-tofunction paradigm (Fetrow et al., 1998) (see Figure 1.1, page 19).

Identifying amino acid sequence similarity between proteins is often the quickest and most reliable way of determining if they are related. By matching a query sequence to the sequence of a closely related annotated template protein, it is usually possible to build an accurate structural model of the query and assign a hypothetical function to it. However, it must be acknowledged that, since the biological functions of proteins can diverge during evolution, more remotely related proteins may have very different functions (Hegyi & Gerstein, 1999; Devos & Valencia, 2000).

Another problem that arises as proteins become more remotely related, even if their higher level structures are preserved, is that their amino acid sequences may diverge to the point where their primary structure similarity is very low. The

Introduction

19

Bacteriocin AS-48

ab-initio model

NK-lysin

Figure 1.1: Sequence-to-structure-to-function paradigm. The leftmost picture shows the structure of Bacteriocin AS-48 (1e68, left) from Enterococcus faecalis, a 70 residues long cyclic bacterial lysin (Gonzalez et al., 2000). This protein is structurally and functionally related to mammalian NK-lysin (Liepinsh et al., 1997) (1nkl, right), despite undetectable sequence similarity, as only 4% of residues are identical after structural superposition. The Bacteriocin sequence was target T0102 in the CASP4 experiment (Moult et al., 2001). An excellent model (middle) was obtained by the Baker group (Bonneau et al., 2001a) using the ab initio method Rosetta, with an RMSD (root-mean-square deviation) of 3.5 ° over all 70 residues. No other method was able to predict A this fold with similar accuracy. A search of the protein structure database usng this model yielded NK-lysin as the first structural match of comparable length. This illustrates an approach that was able to predict the structure that could, in turn, be used to predict the function of the protein. Taken from Ginalski et al. (2005), originally published under open access by Oxford University Press.

Introduction

20

alternative mechanism whereby two proteins can share a related fold, but still have low sequence similarity, is convergent evolution; in this case, two unrelated proteins evolve to adopt the same structural fold despite the fact they share no common heritage. Two structurally similar proteins with no common ancestor are known as analogues. Two structurally similar proteins of an extremely distant common ancestor as known as remote homologues. Although analogues may share a certain degree of sequence similarity, it is usually less than that shared between even remote homologues (Russell et al., 1997).

1.3.1

Structure-based Classification Of Proteins -- SCOP

A useful way of classifying proteins is to cluster them into familial hierarchies. This can be achieved in many ways; the most commonly used methods are sequence clustering, functional motif clustering, or structural similarity clustering. For example, the Protein families database (Pfam; Bateman et al., 2004) is a collection of Hidden Markov Models (HMMs, see § 1.4.8, page 43) constructed from curated multiple sequence alignments, representing many common protein domains and families. The PROSITE database (Sigrist et al., 2002) is another database of protein families and domains, which uses biologically significant sites, described as patterns or profiles, that help to identify the family (if any) to which a novel sequence belongs.

The CATH (class, architecture, topology, and homologous superfamily) Protein Structure Classification (Orengo et al., 1997) is one of the most commonly used structural databases. It is a hierarchical classification of protein domain structures, which only considers crystal structures solved to a resolution of 3.0 ° or better, A together with NMR structures. Classification within CATH is largely automated, with some manual curation, and (at the time of writing) steps are currently being taken to automate the procedure still further.

The Structural Classification Of Proteins (SCOP) database (Murzin et al., 1995;

Introduction

21

Hubbard et al., 1999) uses a combination of sequence, structural, and functional similarities to highlight evolutionary relationships between proteins of known structure; analysis of new proteins is aided by a series of automated procedures. However, final classifications are refined by manual inspection and based on the available evidence. One of its key focuses is the importance of distinguishing homologous and analogous relationships between proteins. The SCOP hierarchy (see Figure 1.2, page 22) was used extensively in this research. 1.3.1.1 SCOP Architecture

Entries within SCOP are classified into a hierarchy comprising six levels: 1. Class -- Summary of secondary structure content. The top level of the hierarchy summarises domains according to their secondary structure content. In SCOP version 1.65 there are five main classes: all- proteins; all- proteins; mixed / proteins; + proteins (domains containing separated and parts); and small proteins (short domains that usually contain disulphide bridges or a complexed metal). Two other minor classes are grouped independently: multi-domain proteins (those with domains of different folds and those with no known homologues); and membrane and cell surface proteins. 2. Fold -- Major structural similarity. Proteins are classified as sharing a common fold if they have similar major secondary structures in the same arrangements, and similar topological connectivity. It is possible for members of a fold to have peripheral secondary structure elements that differ in size and conformation, sometimes to the extent that they comprise up to half of the domain. Similarly, members of a fold may not share any significant sequence similarity and, therefore, may have different evolutionary origins (i.e. they are analogues); high level structural similarities may have arisen from convergent evolution favouring certain packing arrangements and topologies over others. 3. Superfamily -- Probable common evolutionary origin. Each fold contains at least one superfamily, which groups together domains of low sequence identity

Introduction

22

CLASS FOLD SUPERFAMILY FAMILY PROTEIN DOMAIN SPECIES

Figure 1.2: The SCOP (Structural Classification Of Proteins) architecture. The hierarchy

begins with the most general classification of domains at the top, and the most specific at the bottom of the pyramid. At the time of writing, there are 54,745 domains in SCOP. A detailed description of the different SCOP levels is given in § 1.3.1.1 (page 21).

based on evidence that suggests a common evolutionary origin and, therefore, homology. Evidence for homology between domains may be largely based on structure and function, but can include sequence similarity. 4. Family -- Clear evolutionary relationship. Each superfamily contains at least one family that groups together domains that are clearly evolutionarily related. Most family members share at least 30% pairwise sequence identity, however some members may show definitive evidence of common ancestry from their structural and/or functional information. 5. Protein Domain -- Individual protein domains. 6. Species -- Protein domains present in different species. Domains from the same fold, but different superfamilies, are regarded as analogues, having evolved similar structures independently. However, if two proteins are structurally similar, the final decision of their classification is largely subjective: the curators of SCOP tend to be fairly conservative, and if they deem that there is insufficient evidence with which to group two proteins into the same superfamily, then they will most likely be grouped into the same fold, but separate superfamilies.

Introduction

23

1.3.1.2

SCOP Subsets

Due to the high level of sequence redundancy between proteins of known structure, the SCOP database also lists various subsets of sequences clustered according to the maximum percentage of pairwise identity. For example, SCOP version 1.65 contains 54,745 annotated domains; the subset of domains that share less than 95% sequence identity (SCOP95) number approximately 9,400, and there are approximately 4,800 domains in the subset that share less than 30% sequence identity (SCOP30). SCOP30 version 1.65 was used extensively in this research (see § 2.4.1, page 109).

1.4

1.4.1

Sequence-based Recognition Methods

Homology Modelling

When the first computational methods for protein structure determination were being devised, simple sequence-based approaches were key in making structurefunctional predictions, since statistically significant sequence similarity usually signifies homology, related structure, and related function (see § 1.4.5, page 39). The overall quality of a predicted protein structure often depends on how much information from known structures is used in its construction. At one extreme of structure prediction techniques are those that fall under the category of comparative or homology modelling. These methods are based on the idea that proteins that have evolved from a common ancestor will have a substantial degree of sequence similarity (usually 30% or more). When the sequence of a query protein is similar to that of one or more template proteins, it is highly likely that the structures will also be similar. By comparing the amino acid sequence of an unknown query protein against a database of template proteins of known structure, close homologies can be identified and potential structures assigned to the query (Durbin et al., 1998). Due to the vast amount of data now available to those using such prediction methods, these techniques are often used to predict structures for novel gene products. Homol-

Introduction

24

ogy modelling servers include SWISS-MODEL (Schwede et al., 2003), CPHmodels (Lund et al., 2002), 3D-JIGSAW (Bates & Sternberg, 1999; Bates et al., 2001), and ESyPred3D (Lambert et al., 2002).

1.4.2

Dynamic Programming

Dynamic programming is an automated technique commonly used in protein structure prediction. It is a general algorithm that guarantees an optimal alignment of two sequences for a given set of alignment parameters (Bellman, 1957). It is widely used for searching template sequence databases, and aligning query sequences to template sequences in order to build models from the template structures. Dyanmic programming was used extensively in this research. 1.4.2.1 Global Alignment: Needleman-Wunsch Algorithm

The most established biological sequence comparison method, derived from dynamic programming, is the Needleman-Wunsch or global alignment algorithm (Needleman & Wunsch, 1970). This algorithm determines an optimal alignment for the entire length of a query sequence and a template sequence, from the first residue to the last residue in each, allowing for gaps, hence the name global. The following section outlines a more efficient version of the Needleman-Wunsch algorithm, introduced by Gotoh (1982).

For a given template sequence (x) and query sequence (y) with lengths of n and m respectively, both are indexed by i and j, one index per sequence, where xi represents the residue at position i in the template sequence and yj represents the residue at position j in the query sequence. For each position in the substitution score matrix S (see § 1.4.3, page 30) a numerical value score is given according to how favourable it would be to replace a given amino acid in the template sequence with either an amino acid found in the query sequence or, alternatively, an insertion or deletion (indel ). So the value S(xi , yj ) is the score that results from match-

Introduction

25

ing template residue xi to query residue yj . These scores are usually negative for unfavourable substitutions (e.g. matching a proline to a glycine), and positive for favourable substitutions (e.g. matching a leucine to a valine). The algorithm can be refined by also using the affine gap model; this usually results in a highly negative score for starting an indel, but a less negative score for extending an indel. The assumption being that, since gaps tend to be more than one residue long, it must be easier to extend an indel than to start it (e.g. starting a gap may have a score cost of -10, whilst extending a gap may have a score cost of just -1).

For x and y, an (n + 1) × (m + 1) dynamic programming alignment matrix (F) is constructed. The matrix F is indexed by i and j, where the value F (i, j) is the score of the best alignment between the segment x1 to xi of x, and the segment y1 to yi of y. F is built up recursively. After initialising F (0, 0) = 0, the matrix is filled from top left to bottom right. If F (i - 1, j - 1), F (i - 1, j) and F (i, j - 1) are known then it is possible to calculate F (i, j): if xi is aligned to yj then F (i, j) = F (i - 1, j - 1) + S(xi , yj ); if xi is aligned to a gap (i.e. a deletion) then F (i, j) = F (i - 1, j) - d (where d represents the score for an indel); if yj is aligned to a gap (i.e. an insertion) then F (i, j) = F (i, j - 1) - d. The optimal score will be the largest of these three options.

Therefore:

F (i - 1, j - 1) + S(x , y ) Match or Mismatch i j F (i, j) = max F (i - 1, j) - d Deletion (in the template) F (i, j - 1) - d Insertion (in the template)

(1.1)

Equation 1.1 is applied repeatedly until matrix F is full, calculating the value in the bottom right-hand corner of each square of four cells from one of the other three values (above-left, left, or above) as shown in Figure 1.3.

Introduction

26

F (i - 1, j - 1) + S(xi , yj )

UUUU UUUU UUUU UUUU UU* / F (i, j) F (i, j - 1) - d

F (i - 1, j) - d

Figure 1.3: Diagram showing how the value in the bottom right-hand corner of each square of

four cells in the alignment matrix is calculated from one of the other three values (above-left, left, or above) using Equation 1.1. Based on a figure taken from Durbin et al. (1998).

To complete the specification of the algorithm boundary conditions must be defined. Down the first column, where j = 0, the values of F (i-1, j -1) and F (i, j -1) do not exist. Since the values of F (i, 0) represent an initial deletion in the template sequence, then the entire column can be defined as F (i, 0) = -id. Likewise along the top row F (0, j) = -jd.

While matrix F is being filled, a dynamic programming traceback matrix (R) records the pointers showing from which cell each value of F (i, j) was derived. The value in the bottom right-hand corner of the matrix (F (n, m)) is, by definition, the final score for a global alignment between x and y. The alignment itself is constructed by finding the path that led to the final value. This is done using a procedure called traceback. Traceback constructs the alignment in reverse order by starting from the final cell in R and following the pointers that were stored when filling the matrix. By starting in R(n, m) and eventually ending in R(0, 0), it is possible to build an alignment by prepending a pair of symbols (representing the template and the query) for each step taken: xi and yj if the step was to F (i - 1, j - 1), xi and a gap character (`-') if the step was to F (i - 1, j), or the gap character and yj if the step was to F (i, j - 1) (Figure 1.4). It should be noted that traceback describes just one alignment of optimal score; if there are two or more possible alignments with the same optimal score then an arbitrary decision must be made as to which is preferred.

The algorithm can be modified to construct more than one optimal alignment;

Introduction

27

(i) P A W H E A E

(j) H E 0 -8 -16 -8 -2 -9 -16 -10 -3 -24 -18 -11 -32 -14 -18 -40 -22 -8 -48 -30 -16 -56 -38 -24

A -24 -17 -4 -6 -13 -16 -3 -11

G -32 -25 -12 -7 -8 -16 -11 -6

A -40 -33 -20 -15 -9 -9 -11 -12

W -48 -42 -28 -5 -13 -12 -12 -14

G -56 -49 -36 -13 -7 -15 -12 -15

H -64 -57 -44 -21 -3 -7 -15 -12

E -72 -65 -52 -29 -11 3 -5 -9

E -80 -73 -60 -37 -19 -5 2 1

Template (i) - - P - A W - H E A E Query (j) H E A G A W G H E - E Figure 1.4: The global dynamic programming alignment matrix for a pair of example sequences

showing the implementation of Equation 1.1 on page 25; values on the optimal alignment path (from the traceback matrix) are shown in red. The optimal alignment with a total score of 1 is shown below. The indel penalty, d, is given a value of 8 (i.e. for every gap in the sequence, subtract 8 from the alignment score. Based on a figure taken from Durbin et al. (1998).

the set of all possible optimal alignments can be described using a sequence graph structure (Altschul & Erickson, 1986; Hein, 1989). 1.4.2.2 Local Alignment: Smith-Waterman Algorithm

Global sequence alignments work very well for comparing complete sequences that are related. However, if only part of a sequence shares homology with part of another sequence then the Needleman-Wunsch algorithm will still try to align the sequences over their full lengths. This usually results in a drop in the overall score of the alignment because the unrelated segments of the sequences may contribute negative substitution scores.

A more common objective, than constructing a global sequence alignment, is to search for the best alignment between two subsequences of x and y. This is often necessary when there is a significant probability that two protein sequences may share a common domain. It is also a highly sensitive method for detecting similarity between two highly diverged sequences. This is because, usually, only part of such a

Introduction

28

sequence is under strong enough selection pressure to preserve detectable similarity; the remainder of the sequence will have built-up so much noise, through mutation, that it is no longer comparable.

The technique of aligning subsequences of proteins using dynamic programming was introduced in 1981 and is known as the Smith-Waterman or local alignment algorithm (Smith & Waterman, 1981). This algorithm was used extensively in this research. The Smith-Waterman algorithm is closely related to the Needleman-Wunsch algorithm with two key differences. Firstly, an extra option is added to the equation determining F (i, j), allowing it to take a value of 0 if all other options are negative.

Therefore:

F (i - 1, j - 1) + S(xi , yj ) F (i - 1, j) - d F (i, j) = max F (i, j - 1) - d 0

Match or Mismatch Deletion (in the template) Insertion (in the template) New Alignment (1.2)

Taking the option of zero is equivalent to starting a new alignment in the alignment matrix F. As a result, the top row and left-most column of F are filled with zeros, and not -jd and -id as before, and a local alignment can never have a score less than zero. Secondly, it is now possible to end an alignment anywhere in the alignment matrix. As a result, the traceback starting position changes from F (n, m) to the highest value of F (i, j) over the whole matrix. The traceback ends once a cell containing the value zero is reached. 1.4.2.3 Semi-Global Alignments

There are two more dynamic programming alignment algorithms that are sometimes grouped under semi-global alignments, they are the local-global alignment and the

Introduction

29

(i) P A W H E A E

(j) H E 0 0 0 0 0 0 0 0 0 0 0 0 0 10 2 0 2 16 0 0 8 0 0 6

A 0 0 5 0 0 8 21 13

G 0 0 0 2 0 0 13 18

A W 0 0 0 0 5 0 0 20 0 12 0 4 5 0 12 4

G H E E 0 0 0 0 0 0 0 0 0 0 0 0 12 4 0 0 18 22 14 6 10 18 28 20 4 10 20 27 0 4 16 26

Template (i) A W - H E Query (j) A W G H E Figure 1.5: The local dynamic programming alignment matrix for a pair of example sequences

showing the implementation of Equation 1.2 on page 28; values on the optimal alignment path (from the traceback matrix) are shown in red. The optimal alignment with a total score of 28 is shown below. The indel penalty, d, is given a value of 8 (i.e. for every gap in the aligned sequence, subtract 8 from the alignment score. Based on a figure taken from Durbin et al. (1998).

global-local alignment algorithms. 1.4.2.3.1 Local-Global Alignment As the name suggests, the local-global

alignment algorithm is a hybrid of the local and global algorithms, and involves finding the largest segment in a query that will align to the total length of a template (i.e. the algorithm is local against the template and global against the query). Such a method is useful when comparing a multi-domain query protein to a single domain template, since the intention is to align part of the query to the whole of the template.

The differences in performing this type of alignment are: F (0, j) = 0, however F (i, 0) = -id, so the query is aligned globally against the template, and the template is aligned locally against the query. The traceback starts from the highest score in the bottom row of F and ends once it reaches the top row. F (i, j) is defined as in the global alignment (see Equation 1.1, page 25).

Introduction

30

1.4.2.3.2

Global-Local Alignment The global-local alignment algorithm is

effectively the transverse of the local-global alignment algorithm. The intention of the algorithm is to find the largest segment in a template that will align to the total length of a query (i.e. the algorithm is global against the template and local against the query). The differences for this algorithm are: F (i, 0) = 0, however F (0, j) = -jd, so the template is aligned globally against the query, and the query is aligned locally against the template. The traceback starts from the highest score in the right-most column of F and ends once it reaches the left-most column. F (i, j) is defined as in the global alignment (see Equation 1.1, page 25).

1.4.3

Substitution Matrices

The substitution matrix is a vital part of the dynamic programming algorithm, and the matrix S referred to in § 1.4.2 (page 24) can be any combination of substitution scores. Substitution matrices used in protein sequence alignment are 20×20 matrices (sometimes 24 × 24 when the extra alignment characters `B', `Z', `X' and the gap character are included; see Figure 1.6, page 33) where each cell represents a score for a particular amino acid substitution. Ideally, a substitution matrix would score biologically significant alignments as positive and all other alignments as negative (or vice versa depending on the scoring scheme for the alignment algorithm). The easiest way to create a matrix is by deconstructing actual observed biological probabilities of substituting an amino acid (a) with another amino acid (b). The general formula for generating all substitution matrices with a negative expected score is:

a,b ln( pa pb )

q

S(a, b) =

(1.3)

where S(a, b) is the element in matrix S representing the substitution score for replacing amino acid a with amino acid b, qa,b is the target substitution frequency, i.e. the observed frequency with which amino acid a is replaced by amino acid b represented as a probability, which is usually calculated from homologous protein alignments. All target substitution frequencies, for each amino acid, are greater

Introduction

31

than zero and sum to 1. For the sake of statistical theory, a simple model is assumed in which amino acids occur randomly at all positions. Factors pa and pb are background frequencies that represent the overall probabilities of observing a and b in nature respectively. The product of the background frequencies can be regarded as the overall probability of substituting a for b purely by chance. Normalising the target substitution frequency against the background frequencies ensures that conservative exchanges for rare amino acids are weighted more favourably. A logarithm is taken, then a scaling factor specific to the scoring system is applied (), finally S(a, b) is rounded to the nearest integer. These scores are stored in the substitution matrix S and are usually referred to as log-odds scores. The logarithm step is used to increase computational efficiency; instead of multiplying numbers, their logarithms are simply added.

A substitution matrix is uniquely determined by its target substitution frequencies (the background frequencies are the same for most substitution matrices). A key assumption is that the expected score for any S(a, b) must be negative, if not, then long alignments between unrelated sequences will have high scores because of their length.

The most commonly used matrices in protein sequence alignment are the Point Accepted Mutation (PAM) matrices and the BLOCKS Substitution Matrix (BLOSUM) series (see § 1.4.3.1, page 31, and § 1.4.3.2, page 32, respectively). The choice of substitution matrix is often critical in determining the quality of the resulting sequence alignments, although no single substitution matrix is ideal for every instance. The best matrix would be one where the target substitution frequencies specifically characterise the protein family being analysed. 1.4.3.1 The PAM Matrices

The PAM matrices model the evolutionary distance between homologous proteins (Dayhoff et al., 1978); each cell contains a probability of an amino acid (a) being

Introduction

32

replaced by another amino acid (b) over an interval of evolutionary time known as a Point Accepted Mutation (PAM). A PAM is the probability of an amino acid being mutated during the period of evolution in which a single point mutation occurred in 100 residues (i.e. 1% mutation rate). Similarly, 100 PAMs represents the period of evolution in which 100 point mutations occurred in 100 residues. Note that this does not necessarily mean that all 100 residues have been mutated, merely that 100 incidences of a mutation occurred between them; some may have mutated several times, while others may not have mutated at all.

The data used in the original construction of the PAM matrix were collected from closely related proteins. As a result, PAM matrices generally perform better when aligning closely related sequences. Matrices representing n PAMs (i.e. a PAMn matrix) can be obtained by raising the PAM1 matrix to the power of n. Unfortunately, this theoretical extrapolation of the PAM1 matrix does not take into account conservation pressure on structurally or functionally important regions within sequences, and can easily overestimate mutation rates. 1.4.3.2 The BLOSUM Matrices

The BLOSUM matrices (Henikoff & Henikoff, 1992) were originally constructed using target substitution and background frequencies derived from the BLOCKS database (Henikoff et al., 1999), hence the name BLOCKS Substitution Matrix (BLOSUM). In order to build matrices of different identity levels, sequences were clustered according to their minimum percentage identity, e.g. the BLOSUM50 matrix was constructed using sequences that were 50% or more identical. Likewise, the BLOSUM80 matrix was constructed using sequences that were 80% or more identical. The most commonly used BLOSUM matrix is the BLOSUM62, as it is generally regarded as being based on the optimal level of sequence identity for identifying closely related homologues, while not missing more distantly related homologues (see Figure 1.6, page 33).

Introduction

33

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4

R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4

N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4

D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4

C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4

Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4

E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4

Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Figure 1.6: The BLOSUM62 amino acid substitution matrix. Cells contain the log-odds score

for a particular amino acid substitution during a sequence-sequence alignment. Note that the matrix is symmetric and 24 × 24.

BLOSUM matrices tend to perform better in sequence alignments and homology searches than the PAM matrices, particularly when analysing more distantly related homologues (Henikoff & Henikoff, 1993). Since the matrices are constructed without theoretical extrapolation (like the PAM matrices), they demonstrate significant improvements in estimating the mutation rates between distantly related sequences (Henikoff & Henikoff, 1992).

Introduction

34

1.4.4

The FASTA and BLAST Heuristic Methods

Dynamic programming algorithms are guaranteed to provide an optimum alignment between two sequences for a given set of parameters. However, since the algorithms are exhaustive, they become considerably slower as the size of the sequences increases. Since the 1980s, several heuristic (i.e. non-exhaustive) techniques have been developed to efficiently process the huge number of publicly available protein sequences. Significant sequence similarities can be found by comparing relatively short regions of a query and a template without actually performing full dynamic programming (see § 1.4.2, page 24). If the initial search steps are successful, then dynamic programming may be used to produce final results for a substantially reduced list of templates. Such heuristic techniques are fast but cannot guarantee an optimal alignment or search result. However, their speed and efficiency have made them very popular. The two most commonly used heuristics are FASTA and BLAST. 1.4.4.1 The FASTA Method

The first widely used heuristic method was developed by Wilbur & Lipman (1983). It was later further developed into the FASTA algorithm (Pearson & Lipman, 1988; Pearson, 1990), which can be used for both DNA and protein sequence searches.

The FASTA algorithm (Figure 1.7, page 35) can be summarised as follows: 1. For each template sequence within a databank, search for any words (unbroken strings) of length ktup that it shares with the query sequence (e.g. when ktup = 3, the algorithm searches for words three residues in length). 2. Search for regions of high percentage identity. By examining the identity plot between the template and the query and identifying tightly packed clusters of words along the same diagonal. If the distance (in residues) between an initial word and an adjacent word is smaller than the match score for the initial word, then the words are extended until they merge to form a region.

Introduction

35

(a) Sequence B

(b) Sequence B

Sequence A

Find runs of identities.

Sequence A

Re-score using PAM matrix. Keep top scoring segments.

(c) Sequence B

(d) Sequence B

Sequence A

Apply "joining threshold" to eliminate segments that are unlikely to be part of the alignment that include highest scoring segments

Sequence A

Use dynamic programming to optimise the alignment in a narrow band encompassing the top scoring segments.

Figure 1.7: A summary of the FASTA sequence comparison method. Based on a figure taken

from Barton (1996).

Introduction

36

3. The 10 highest scoring individual regions are re-scored, using a PAM250 matrix (see § 1.4.3.1, page 31), and trimmed or extended to maximise the score. This is known as partial gapless alignment. 4. If there are enough regions above a given cut-off score, then these are joined, using dynamic programming, to produce a gapped alignment. The final score (initn) is used to rank the template sequence. 5. For each of the top ranked template sequences, a local alignment (see § 1.4.2.2, page 27) is constructed with the query sequence using a 32 residue window superimposed on the highest scoring initn region. The final total is the reported score. These techniques do not necessarily reduce the number of template sequences that are examined, but does limit the amount of exhaustive dynamic programming to a substantially reduced search space. 1.4.4.2 The BLAST Method

The Basic Local Alignment Search Tool (BLAST; Altschul et al., 1990) originally used heuristic (i.e. non-exhaustive) techniques to rapidly search large databanks and produce ungapped alignments. It used the BLOSUM62 matrix (see § 1.4.3.2, page 32) to perform rapid segment (unbroken string) searches within an indexed sequence databank. Any segments found to score above a specified threshold were then extended using dynamic programming. It was later refined to make it more sensitive and capable of producing gapped alignments (Altschul & Koonin, 1998).

In its current incarnation, BLAST is a sequence comparison algorithm (optimised for speed over precision) used to search template sequence databanks for optimal local gapped alignments to a query sequence. The initial stage of the algorithm searches a databank index for a segment of length W that scores at least a value of T when compared to the query sequence using an ungapped local alignment (these

Introduction

37

are referred to as hits). Segment hits are then extended in either direction to see if they form an ungapped alignment with a score no less than a given threshold (S). The extension is halted when either the cumulative alignment score drops below a given threshold (X) from its maximum achieved value, or the cumulative score falls to zero or less, or the end of either sequence is reached. If successful, the alignment is known as a high-scoring segment pair (HSP). The maximal-scoring segment pair (MSP) is defined as the highest scoring HSP. The T parameter dictates the speed and sensitivity of the search, while S tends to affect its selectivity. BLAST is much faster than FASTA and permits almost all HSPs (under given parameters) to be located in a sequence databank.

The BLAST algorithm (Figure 1.8, page 38) can be summarised as follows: 1. For a given segment size (W , usually 3) and a given substitution matrix, list all possible segments from a given query that score at least T . 2. Use these identified query segments to find corresponding segments in the template databank (referred to as hits). 3. BLAST checks if any given query-template (QT) pair share at least two nonoverlapping hits within a given distance threshold (A) on the same diagonal of their identity plot. If any two hits overlap, then the most recent hit is ignored. 4. If the previous step succeeds, then an ungapped bidirectional extension of the second hit is triggered to find an HSP. If the cumulative score of the ungapped alignment is greater than S, then the extension ends when it can no longer be improved. The extension is halted when either the cumulative alignment score drops below a given threshold (X) from its maximum achieved value, or the cumulative score falls to zero or less, or the end of either sequence is reached. 5. The highest scoring HSP (the MSP) is further extended in both directions using a gapped alignment. The highest scoring gapped alignments are then realigned with relaxed parameters in order to maximise the alignment length.

Introduction

38

(1) For the query find the list of high scoring words of length w.

Query Sequence of length L Maximum of L-w+1 words (typically w=3 for proteins)

For each word from the query sequence find the list of words that will score at least T when scored using a pairscore matrix (e.g. PAM 250). For typical parameters there are around 50 words per residue of the query.

(2) Compare the word list to the database and identify exact matches.

Database sequences

Exact matches of words from word list

(3) For each word match, extend alignment in both directions to find alignments that

score greater than score threshold S.

Maximal Segment Pairs (MSPs)

Figure 1.8: A summary of the BLAST sequence alignment method. Based on a figure taken

from Barton (1996).

Introduction

39

This two-hit method reduces the number of bidirectional extensions, which is the most time consuming step in the BLAST algorithm. Since fewer local alignments are performed by BLAST, compared to FASTA, it is can be faster without compromising its sensitivity.

1.4.5

Alignment Statistics

As stated previously, for an alignment algorithm to work, the expected score for a random match must be negative. If this condition is ignored, then alignments between long but unrelated sequences are likely to have high alignment scores simply because of their length. Similarly, for local alignments, there must be at least one value in the substitution matrix S greater than 0, otherwise no alignments will be found at all (Altschul & Gish, 1996). To expand on the explanation of why the expected score of a random match must be negative, consider an ungapped alignment of fixed length; since consecutive positions within the alignment are independent, only a single position need be considered, and the condition that must be satisfied is:

pa pb S(a, b) < 0

a,b

(1.4)

where pa is the probability of amino acid a appearing at any position in a sequence. Since S(a, b) is defined as a log-odds ratio (see Equation 1.3, page 30), this condition is always satisfied because:

pa pb S(a, b) =

a,b a,b

pa pb ln

qa,b =- pa pb

pa pb ln

a,b

pa pb = -H(p2 qa,b

q)

(1.5)

where H(p2

q) is the relative entropy of distribution p2 with respect to the

distribution q (Durbin et al., 1998, chap. 11). By definition, the relative entropy is always positive unless p2 = q, so -H(p2 q) will always be negative. Unfortunately,

no equivalent analytical method for optimal gapped alignments exists.

Introduction

40

1.4.5.1

Calculating E-Values

A critical function of systematic sequence alignments is the ability to distinguish between those that are biologically significant and those that are biologically insignificant: those that are biologically significant are most likely to be homologous and (therefore) most likely to provide valuable information for structural and functional characterisation.

Based on the assumed conservation of biologically important residues within homologous families, the simplest way of quantifying biological relevance is by calculating how likely a pair of sequences, of a given length and composition, are of producing a given alignment score purely by chance. These statistical measurements can be represented as P-values or E-values. A P-value is the probability of seeing at least one score (T ) greater than or equal to some score (x) in a database search of n sequences. An E-value is the expected number of biologically insignificant sequence alignments with scores greater than or equal to a score x in a database search of n sequences. For the purposes of this research, only E-value calculation is of interest. For a full description of the derivation and calculation of E-values see Appendix A (page 231).

1.4.6

Position Specific Scoring Matrices and Profiles

Several successful homology recognition methods use position specific scoring matrices (PSSMs; Bork & Gibson, 1996). These are special substitution matrices, specifically constructed for a particular protein from a multiple sequence alignment. They are capable of finding much more remote protein homologues since their substitution scores apply only to the family of proteins used in the multiple sequence alignment. PSSMs are often referred to as profiles, although, strictly speaking, a PSSM is a reinterpretation of a profile (Gribskov et al., 1987). In this work, the terms are used interchangeably.

Introduction

41

A profile is constructed using data from a multiple sequence alignment that describes the probability of matching a particular amino acid to a particular point in a protein. It is a matrix of dimensions n × 20, where n is the length of a sequence (x). At each position along the sequence (xi ), a substitution score for each of the 20 amino acids is given. The biggest difference between a profile and a typical substitution matrix is that the score for matching a given amino acid is usually different, depending on its position along the sequence. The different types of PSSM are generally divided into three categories: 1D, 2D and 3D-PSSMs (terms coined by Kelley et al., 2000). These names are not meant to refer to the number of dimensions in the PSSMs, but rather to the type of data used in their construction. 1.4.6.1 1D-PSSMs

A 1D-PSSM is designed to reflect the position specific primary structure substitution propensities of a given protein sequence. Whereas a generic substitution matrix (see § 1.4.3, page 30) is derived by averaging mutational frequencies across a wide range of proteins, a 1D-PSSM is constructed using the multiple sequence alignment of a query protein and its known homologues (Gribskov, 1994; Henikoff & Henikoff, 1992). The values of a 1D-PSSM are usually calculated in the same way as for a substitution matrix, using Equation 1.3 (page 30); the probabilities of substituting amino acid a with amino acid b are based upon the number of observations in the multiple sequence alignment for the query protein.

A 1D-PSSM provides a greater depth of information with regard to a specific query, offering insights into evolutionary relationships with other proteins, and suggesting key residues that may perform vital structural and/or functional roles. The use of 1D-PSSMs often greatly increase the power of homology recognition techniques, in addition to increasing the accuracy of secondary structure prediction. The accuracy of profile-based homology recognition can be further augmented by using 1D-PSSM for both query and template proteins (see § 3.4.3, page 133).

Introduction

42

1.4.7

The PSI-BLAST Heuristic Method

Further development of the BLAST algorithm (see § 1.4.4.2, page 36) produced the more powerful Position Specific Iterated BLAST (PSI-BLAST; Altschul et al., 1997), which has enhanced search sensitivity through using sequence profiles as queries. This was the first tool in which the use of profiles as search queries was fully automated and coupled with robust statistical theory. Searching a database with a profile as a query often provides greater depth and coverage, identifying remotely related homologues as well as closely related homologues.

The first stage of a PSI-BLAST search is a standard BLAST, using a query sequence, a standard substitution matrix (e.g. BLOSUM62) and a sequence databank. Results with an E-value lower (i.e. better) than or equal to a given cut-off E-value are stored, and a multiple sequence alignment is constructed from these results. The multiple sequence alignment is converted to a sequence profile, which is then used in the subsequent iteration searches. After more homologues have been detected, the profile is refined to include the extra information from the new alignments. This iterative method allows PSI-BLAST to find more sequence homologues; it combines all the necessary steps to construct a profile, which it then uses to scan against a databank. The number of iterations continues for as many cycles as desired, or until no new homologues can be detected. The result is a complete list of sequences, from the final iteration, with E-values better than the cut-off.

PSI-BLAST is approximately three times more sensitive in the detection of remote homologues than BLAST alone (Park et al., 1998). However, as the process repeats itself, the profile will either become more refined, or will become more generalised (often referred to as drift). Some regions of a profile may become highly specific -- usually representing structurally or functionally important residues of a protein. Others may become more generalised with each iteration -- representing residues (and probably structures) of high variability. Occasionally the search will

Introduction

43

drift after a remote protein of a different structure and/or function enters the profile, causing the direction of the search iterations to change. For a detailed description of how PSI-BLAST constructs its profiles see Appendix B (page 243).

1.4.8

Hidden Markov Models

A technique that is closely related to profile searching is that of using hidden Markov models (HMMs) to search databases. An overview of this technique and its application to sequence comparison is provided in a review by Eddy (1998).

A HMM is a model that consists of a variety of states, which associates the transitions between those states with probabilities. Sequences of any sort can be represented by first order Markov chains. Within one of these chains, a character in a sequence is dependent on the previous character, but is not dependent on the full series of previous characters in the sequence. For the purpose of biological sequence comparison, a HMM may contain several different states that infer biologically meaningful properties. For example, characters in a sequence may be either hydrophobic (H) or polar (P ). The process of switching between these states is called a transition, and each transition is associated with a probability (tab ), where a is the starting state and b is the finishing state. All transition probabilities from one state to all other possible states must sum to one. For example, a Markov chain that exists in the hydrophobic state H can either switch to a polar state P (tHP ) or remain in the hydrophobic state (tHH ). Whatever the values are for tHP and tHH , their sum must be equal to one.

In addition to this process, each state can emit one of the 20 amino acids for a protein sequence, and each emission is associated with a probability (eab ) where a is the state and b is the emitted character. The sum of the probabilities for all possible emissions in a given state must also sum to one. Only the emission characters (the single letter amino acid codes) of the HMM are observed; both the states and the

Introduction

44

Protein Sequence Database

BLOSUM62

Position Specific Scoring Matrix (PSSM)

1 2 3 4 5 6 M N L Y D L A -2 -3 -1 -4 0 -1 R -3 -3 -4 -3 0 -2 N -4 4 -4 -4 -1 -5 D -5 -7 -5 -6 3 -5 C -2 -2 -1 -4 -3 -1 ... ... ... ... ... ... ...

START

BLAST search

Input/Query Sequence

Filter hits (E-value < x) List of Sequence Hits New Sequence Hits

Convert to PSSM

Iterative search

Multiple Sequence Alignment

MNLYDLLELPTTASIKIAYRLA

Return

No

Yes

add to

List of Sequence Hits

Create multiple sequence alignment & purge highly similar sequences

Figure 1.9: A summary of the PSI-BLAST sequence database search method. The procedure

starts by running BLAST for a query sequence against the sequence database, using a standard matrix (here BLOSUM62). Next, the PSSM, instead of the query sequence and the BLOSUM62 matrix, is used for the database search. A new PSSM is constructed in every cycle until no new sequences can be found, or a set number of cycles have been completed. A search cycle is called u an iteration. See § 1.4.7 (page 42) for more details. Based on a figure taken from M¨ller (2002).

Introduction

45

transitions between them are hidden. Therefore, the Markov chain is referred to as a hidden Markov chain. The dependency of a character in a sequence, on the previous character in the sequence, is actually the transition state between two emissions. Inferring a hidden state sequence (i.e. hydrophobic and polar) from a given amino acid sequence marks it with higher order biological information. As such, a HMM can be used to model the higher level biological characteristics of a protein family. Therefore, sequences generated randomly by that HMM should also be typical of that protein family. Sequences with a high probability of being derived from the HMM are likely to belong to the protein family that the HMM describes. These methods are highly effective in genome annotation (Krogh et al., 1994).

Figure 1.10 (page 46) represents a two state HMM for hydrophobic and polar states, with the transitions between these respective states. The probability that a sequence ARDE is modelled via H H P P is given by:

Pr(ARDE|HHP P ) = eHA × (tHH × eHR ) × (tHP × eP D ) × (tP P × eP E ) = 0.30 × (0.80 × 0.25) × (0.20 × 0.40) × (0.90 × 0.20) where eab is the probability of emitting character b while in state a. The probability that the sequence can be modelled by the HMM is the sum of the probabilities for the sequence, given every possible combination of states. Dynamic programming is often used to find the optimal path through a HMM for a given input sequence, where the rows and the columns contain the sequence characters and the states.

Homology based sequence searches using carefully crafted HMMs, to represent protein families, perform better than PSI-BLAST in detecting distant protein relationships (Park et al., 1998). However, systems that involve high quality HMMs also tend to use manual curation as part of their process (Bateman et al., 1999, 2002; Letunic et al., 2002; Gough & Chothia, 2002).

Introduction

46

tPH = 0.10 tHH = 0.80

H

tHP = 0.20 eHA = 0.30 eHR = 0.25 eHD = 0.05 eHE = 0.01 ...

P

tPP = 0.90

ePA = 0.02 ePR = 0.05 ePD = 0.40 ePE = 0.20 ...

Figure 1.10: Schematic example of a two state hidden Markov model, to assign a residue in

a protein sequence to either a hydrophobic (H) or a polar (P ) state. The transition probabilities are represented by tab where a is the starting state and b is the finishing state. The emission probabilities are represented by eab where a is the state and b is the emitted character.

1.4.8.1

Profile HMMs

Krogh et al. (1994) introduced an HMM architecture capable of representing profiles of multiple sequence alignments. For each column of the multiple sequence alignment, a match state models the distribution of the residues that appear in the alignment column. An insert state and a delete state in each column allow for insertion of one or more residues between that alignment column and the next, or for deletion of the residue in the alignment column (see Figure 1.11, page 47).

Profile HMMs that model a protein or domain family, such as those used in Pfam (Bateman et al., 2004) and SMART (Schultz et al., 2004), usually derive the probabilities for e and t from multiple sequence alignments. An initial HMM may only model a limited number of relatively closely related protein family members. However, such HMMs can be iteratively refined in a similar way to the method PSI-BLAST uses to refine its profiles (Bateman et al., 1999). With each iteration, the HMM is able to model progressively more divergent members of the protein

Introduction

47

D0

D1

D2

D3

I0

I1

I2

I3

I4

Begin

M0

M1

M2

M3

End

Figure 1.11: The structure of a Profile HMM. The bottom line of squares represent the match

states, which are used to model the columns of the multiple sequence alignment by emitting amino acid characters. In these states, the probability distribution is the frequency of the amino acids in the multiple sequence alignment. The second row (diamonds) represents the insert states, which are used to model highly variable regions in the alignment. They emit amino acid characters according to their own probability distributions. The top line of circles represent delete states. These are a different type of state, called a silent or null state. They do not match any residues; they merely make it possible to jump over one or more columns in the multiple sequence alignment by emitting gap characters. The arrows between each of the states represent the transition probabilities for moving between the respective states.

family. The most commonly used profile HMM packages are HMMer (Eddy, 1998) and SAM (Hughey & Krogh, 1996); these programs construct, refine, and manage HMMs, and also search libraries of HMMs against a query sequence. They use a mixture of Dirichlet priors on most distributions to avoid over-fitting and to limit the number of free parameters (Sj¨lander et al., 1996). o

Profile HMMs are similar to profiles in that they encode the probability of the occurrence of each of the 20 amino acids at each position in a query protein, based upon a multiple sequence alignment. However, instead of affine gap penalties, profile HMMs operate with position-specific gap penalties, based on the occurrence of insertions or deletions within the same multiple sequence alignment. Since HMM search tools do not use any prefiltering heuristic techniques, their power and sensitivity for finding remote homologues exceeds that of PSI-BLAST in most benchmarking experiments. However, HMMs need to be carefully constructed in order

Introduction

48

to produce their best results and the majority of end users lack the expertise to do this effectively; as a result, PSI-BLAST tends to be the more favoured search tool. PSI-BLAST was the tool used for profile construction in this research.

1.5

Fold Recognition and Threading

In the past few years, it has become more common to see new structures, determined by X-ray crystallography or nuclear magnetic resonance (NMR), that belong to known folds but have no obvious sequence relationship to any known template protein. This has stimulated the development of techniques that use additional information, in conjunction with sequence data, in order to identify folds. Since the primary goal of such methods is to identify protein relationships through structural similarity, rather than sequence similarity, they are known as fold recognition methods.

Fold recognition was originally devised to recognise relationships between structural analogues, such as TIM barrels, but was later found to be very powerful for detecting remote homologies as well. The boundary between homology modelling (see § 1.4.1, page 23) and fold recognition is dynamic with no strict definition. This constitutes the protein structure prediction Twilight Zone, so called because there is no strict blanket level of sequence similarity at which homology modelling fails and it becomes necessary to resort to fold recognition in order to improve prediction reliability. Continuously improving sequence comparison methods, and increasing numbers of known protein structures in the PDB, have shifted this zone further and further down the sequence similarity scale. It is now possible to identify structures, which would have once required fold recognition to produce reliable models, using sequence comparison homology modelling. Presently, the Twilight Zone is estimated to lie around the 20% sequence similarity level (Jaroszewski et al., 2002).

Although no one technique currently leads the field, it should be noted that the

Introduction

49

more successful fold recognition methods all use advanced sequence comparison or a combination of sequence and structural information (see § 1.6.3, page 72).

1.5.1

Threading

Threading was first inspired by the discovery of analogous proteins, which revealed that non-random primary structure similarity is not always necessary for secondary and tertiary structure similarity. At the same time, analyses of amino acid arrangements in space suggested a general tendency of given residue types to lie within certain distances from each other. These ideas were later developed into what are known today as pair potentials or contact potentials: interaction propensities between amino acids represented in the form of a substitution matrix (Sippl, 1990). Threading takes its name from the idea of conceptually threading a query protein sequence through the three-dimensional structure of a given template; as the query sequence incrementally slides through the template structure, the positioning of its residues are translated into scores representing each residue's spatial environment (the residue fitness). The sum of these scores provides an overall measure of the quality of the final query model, according to the residue interaction scoring scheme. Extending this idea even further, by performing alignments of proteins based on tertiary structure, rather than primary structure, it became possible to construct N × 20 profiles (where N is the length of the protein sequence) reflecting interaction propensities for specific protein families.

There was, and still is, one fundamental problem with the above procedure: when threading is performed between a query sequence and a template structure that share a high percentage of primary structure identity, it is likely that each evaluated residue in the template structure and query model will share similar environmental surroundings. As a result, when analysing the fitness of the query sequence to the template structure, the surrounding structural environments for each residue of the query can be kept identical to those observed in the template structure. This

Introduction

50

threading method is known as the frozen approximation, and is as quick to perform as a sequence-profile comparison. However, since threading is designed to identify structural similarity between analogous proteins, or remote homologues where the primary structure similarity is very low, there is a greater tendency for each of the evaluated residues in the respective structures to be surrounded by different amino acids. As a result, it becomes necessary to use the defrosted approximation (Jones et al., 1992; Bryant & Lawrence, 1993) to update the surrounding amino acids of the template with the aligned amino acids of the query protein when recalculating the fitness of a given residue. This slows down the threading algorithm dramatically because, every time the query sequence is threaded further onto the template structure, the fitness of every residue in the alignment must be recalculated (since each residue has a corresponding contact potential with every other residue). When gaps are introduced into the threaded query sequence, the computational complexity of the algorithm is increased by exponential proportions. Use of the frozen approximation on analogous proteins produces poor models, but the defrosted approximation is prohibitively slow, taking hours for a single result. Use of the defrosted approach for large scale annotations or database searches is computationally infeasible; however, it can produce highly accurate models when used the correct template is known (Moult et al., 1999).

Interactions between hydrophobic residues are easy to explain given the fact that such residues are expected to be packed together in the interior of a protein, forming a hydrophobic core. Similarly, hydrophilic residues tend to be in close proximity when jointly exposed to the aqueous solution surrounding the exterior of a protein. While these tendencies have an undoubted influence on the calculation of pair potentials, it was hoped that contact-based scoring systems would help to capture elusive information about other essential, specific residue interactions that help to shape the native structures of proteins. Pair potentials have been used in fold recognition with a substantial degree of success; however, some groups have failed to obtain any additional improvement in the quality of their recognition algorithms, through

Introduction

51

using pair potentials, after considering predicted secondary structure (Sternberg, personal communication, 2002). These observations suggest that pair potentials offer a greater insight into secondary structure than they do into higher-level tertiary structure. At the same time, the use of predicted secondary structure in fold recognition algorithms has been rapidly increasing because of its ease of use and the improvement it provides (Rost, 2001).

1.5.2

Fold Recognition Using Profiles

Despite the fact that its slowness renders it computationally impractical for largescale, exhaustive searches, threading experiments demonstrated that empirical prediction of protein structures was possible. This spurred on the development of new fold recognition methods that could be applied to the rapidly expanding amount of available genomic data. As the number of publicly available protein sequences increased, new light was shed on the concept of threading: if local structural environment within a protein could dictate a preference for certain positional mutations over others, then it would be theoretically possible to model these preferences within a primary structure profile if enough evolutionary data were available. It might even be possible to detect analogous relationships using profiles, the logic being that the distribution of residues occurring at key structural positions within a remote homology alignment may reflect the tendency of certain regions of a protein to possess particular properties; for example, a given family of proteins may have a tendency to possess a high density of hydrophobic residues within crucial core structural regions without being particularly discerning about which hydrophobic residues are used. In such a case, it may be possible to recognise a protein that belongs to the same fold, but not to the same superfamily because it, too, may have such a tendency. As the amount of genomic information available to the scientific community grew, the number of protein families of known structure, with insufficient numbers of sequentially diverse homologues to model position-specific mutational tendencies, began to fall. Threading techniques are now only necessary in the small number

Introduction

52

of cases where there are not enough homologues to construct a profile for a given query protein. 1.5.2.1 2D-PSSMs

The idea of using predicted structural information, in order to increase remote homology detection, is well established (Fischer & Eisenberg, 1996). When constructing a databank of template proteins of known structure, each template entry is given an amino acid sequence, representing its primary structure, and another sequence, representing its secondary structure. Secondary structure sequences are one-dimensional strings of characters from a given alphabet (ranging from three letters to more than 10; Kabsch & Sander, 1983; Frishman & Argos, 1995) usually represented by `H' for helical regions, `E' for strand regions, and `C' for coil regions. In fold recognition techniques, this secondary structure data is used in conjunction with the primary structure data when determining the quality of a query-template (QT) match. This involves calculating an additional score, representing the measure of the secondary structure similarity. This is usually done in parallel with the primary structure comparison. The simplest way of comparing secondary structures is to align them using dynamic programming and calculate the alignment score based on a given secondary structure substitution matrix. Since the final structure of a given query protein is hardly ever known, it is necessary to predict its secondary structure. The exact method for calculating the secondary structure substitution matrix varies from technique to technique, but it generally involves a degree of trial and error, empirical determination, and educated guess-work. As with primary structure, secondary structure can also be represented as a PSSM (a 2D-PSSM), designed to reflect the position specific secondary structure substitution propensities of a given protein. In this case, each position along a protein sequence is given a series of scores to represent each possible secondary structure substitution at that point.

When used in fold recognition, secondary structure data appears to be the most

Introduction

53

information-rich feature. However, other data can also be used to enhance the quality of a QT alignment. These include: solvent accessibility, residue hydrophobicity, residue size, and hydrogen-bonding capacity. 1.5.2.2 3D-PSSMs

Logic would suggest that a PSSM containing tertiary structural information (a 3DPSSM) would enhance fold recognition accuracy even further. However, at this level, it becomes harder to accurately encode such variable data within a profile. The seminal work by Bowie et al. (1991) was the first to introduce the idea of using tertiary structure data in fold recognition (they referred to such data as three-dimensional profiles). Their method involved categorising each residue within a protein into an environment class based upon: (i) the total area of the side-chain that is buried by other protein atoms; (ii) the fraction of the side-chain area that is covered by polar atoms or water; and (iii) the local secondary structure. Using this technique, the three-dimensional structure of a protein could be converted into a one-dimensional string. This string was then aligned to a query protein primary structure, and an ideal score was calculated based on the preferences of each amino acid for different environmental classes. For example, it would be rare to find a charged residue buried in a non-polar environment.

Compared to 1D-PSSMs, which are constructed according to multiple alignments of primary structure homologues, 3D-PSSMs are constructed from alignments based on structural superpositions between proteins that share the same SCOP superfamily but no discernible sequence similarity. A more detailed description of the 3DPSSM construction can be found in § 1.5.5 (page 59). 3D-PSSMs provide a greater depth of information about the evolution of a given structural family, which can, subsequently, be used to increase the accuracy of fold recognition techniques that employ them. This is due to the fact that there are often many proteins within the same fold group (i.e. they share similar three-dimensional structure) that have no discernible amino acid sequence similarity. Therefore, it is possible to extend the

Introduction

54

multiple sequence alignment for a template protein using structural alignments to more remote homologues. The resulting PSSM provides a greater coverage of protein homology search space and is able to detect homologies too remote for standard heuristic sequence search tools.

1.5.3

Critical Assessment of Techniques for Protein Structure Prediction (CASP) -- The Development of Fold Recognition

The last few decades have seen a plethora of techniques developed to tackle the problem of protein structure prediction. As a means of assessing the quality of the individual methods, the Critical Assessment of Techniques for Protein Structure Prediction (CASP -- http://predictioncenter.org/) meeting was founded, with the first meeting taking place in Asilomar, California, in 1994. Since then, the biennial evaluation has tracked the progress made by individual scientific teams, and within the field of protein structure prediction as a whole. The purpose of the CASP meeting is to simulate the real-world scenario of a blind prediction. During evaluation, each prediction method is presented with a series of amino acid sequences, the three-dimensional structures of which have recently been solved but not released into the scientific domain, and attempts are made to produce the most accurate models. Once all candidate models have been handed back to the meeting organisers, they are assessed for quality, according to the true tertiary structures of the query proteins.

The impetus behind orchestrating such an elaborate method of assessment arose in response to the complications faced when cross-validating systems that rely primarily on empirical information. When attempting to cross-validate a prediction system, the usual method is to test its inherent accuracy by removing any training data that may artificially sway its results. For example, when testing to see if a given system can accurately predict the three-dimensional structure of a query pro-

Introduction

55

tein (Q) the automatic reaction is to remove any training data that may make the task too easy -- that is to say, all close (and the majority of distant) homologues are removed from the training data before the test is performed. However, by removing training data, the inherent accuracy of the system is reduced and, therefore, it is no longer certain that the results are a reliable representation of its capabilities. Similarly, comparison of independent benchmarks is unreliable since different groups tend to assess their systems using different testing sets. In the end, the dilemma of what to do in order to judge a variety of systems without giving unfair advantage by over-training, or unfair disadvantage by under-training, remains.

To circumvent this problem, the CASP meeting was devised to simulate how the different structure prediction methods would be used when made available to the scientific community, and to compare how accurately they would perform in a real-world situation (which is ultimately what structural biologists want to know). As a true blind trial, the CASP evaluation provides a powerful means of comparing the vast range of prediction techniques. 1.5.3.1 CASP1

The first CASP meeting, in 1994, demonstrated that recognition of correct folds for the threading targets was possible, as each target had its fold correctly identified by at least one group. However, the quality of alignments with those folds was generally quite poor (Lemer et al., 1995). 1.5.3.2 CASP2

CASP2, in 1996, had nearly 500 fold recognition predictions, compared to around 100 at CASP1 (Marchler-Bauer & Bryant, 1997), showing the increased interest in the field. The techniques used by the most successful groups were varied. However, it was clear that there had been dramatic progress since CASP1 in fold recognition as a whole (Levitt, 1997). Murzin & Bateman (1997) used a largely manual analysis

Introduction

56

relying on sequence information, and any functional or experimental information available, to predict target structures. They used the SCOP database as a guide to structural classification and HMMs for detailed sequence analysis. Two other teams made extensive use of pair potentials. Fl¨ckner et al. (1997) used ProFit (pair o potentials used with dynamic programming; Lemer et al., 1995; Fl¨ckner et al., o 1995) to identify potential structures, purposely avoiding any multiple sequence alignment information, secondary structure predictions, or any other available data. Rice et al. (1997) used a combination of techniques in their recognition algorithm. Along with 3D-profiles and environmental profiles, they also made considerable use of secondary structure predictions in order to enhance their accuracy. The power of HMMs was demonstrated by the performance of two of the most successful groups. Karplus et al. (1997) used the SAM HMM software suite (Hughey & Krogh, 1996) to recognise folds, using mainly sequence-based information. Di Francesco et al. (1997) made further use of secondary structure by searching their database with sequence-based HMMs that were calibrated by aligning sequences of experimentally determined secondary structure states of protein family members. 1.5.3.3 CASP3

By CASP3, in 1998, it was apparent that threading methods were still the most successful fold recognition methods; however, they were coming under increasing pressure. Additionally, as threading methods had begun to incorporate more sequence information, and sequence-based techniques had begun to include more structural information, the approaches were starting to converge in terms of the results they produced (Murzin, 1999).

Jones et al. (1999) used GenTHREADER (Jones, 1999b) to successfully search for appropriate templates. GenTHREADER initially used a sequence-profile method to search its fold library, and then uses THREADER (Jones, 1998) to refine and evaluate the final model. THREADER used a `double dynamic programming' threading approach (Jones et al., 1992), which was refined by using predicted secondary struc-

Introduction

57

ture from PSIPRED (Jones, 1999a). Domingues et al. (1999) used ProFit (Fl¨ckner o et al., 1997; Sippl & Weitckus, 1992), along with several other additional assessment methods, to search their fold library. Panchenko et al. (1999) used sequence comparisons against a library of template profiles, combined with pair potential scores within the conserved core elements of the template structures. Their aim was to combine evolutionary background information with assessments of physical plausibility; their results showed considerable improvement over separate sequence-profile and pair potential threading systems. Ota et al. (1999) used a cooperative approach between several different predictors, using a combination of sequence-sequence and sequence-profile searches, enhanced by secondary structure predictions, to search template libraries. In addition, they used sequence motif searches and several different threading methods before deciding on their final submissions by committee. Koretke et al. (1999) utilised HMMs, built from PSI-BLAST searches, to search the PDB for potential templates; any potential template families were incorporated into the HMM for a more sensitive search. When this method failed to produce any confident hits, predicted secondary structure information was added into the search algorithm. Karplus et al. (1999) purposely used a sequence-only HMM approach to show that successful fold recognition could still be achieved using such techniques. 1.5.3.4 CASP4

The assessment of fold recognition methods at CASP4 showed that the field of fold recognition was growing rapidly and that the best-performing groups were making substantial progress (Sippl et al., 2001). There had been rapid advances in automated methods, and threading techniques, using pair potentials, were being outperformed by approaches that combined sequence information (i.e. profiles and HMMs) and structural information.

Bates et al. (2001) used a combination of 1D-PSSMs and 3D-PSSMs to produce some of the best automated predictions, as well as some of the most accurate manually curated submissions. For a full description of their method see § 1.5.5 (page 59).

Introduction

58

Bonneau et al. (2001b) performed well using the Rosetta ab initio protocol (Simons et al., 1999), outperforming many of the other groups on many of the more difficult targets. Koretke et al. (2001) had found that much of their previous success, at CASP3 (Koretke et al., 1999), was largely due to their sequence-based methods; with the ever increasing number of known sequences available, they focused on template identification using intensive PSI-BLAST searches and used HMMs for aligning the target sequence to the template structure. Karplus et al. (2001) used two methods: SAM-T99 and SAM-T2K. SAM-T99 was a fully automated sequence-only HMM server, while SAM-T2K was similar to SAM-T99 but used the predicted secondary structure of the target protein and the known secondary structure of the template proteins in its HMM algorithm (plus additional manual inspection). The addition of secondary structure information produced a marked improvement in SAM-T2K compared to SAM-T99. Williams et al. (2001) used the FUGUE server (Shi et al., 2001), which utilised sequence-profile and profile-profile comparisons to search a fold library, as well as environment-specific substitution tables based on solvent accessibility, hydrogen bonds, and backbone conformation. Murzin & Bateman (2001) used the same techniques that they had used in CASP2 (Murzin & Bateman, 1997), while taking advantage of newly available known structures and sequences. Their success showed that such approaches, based on human expertise, were still comparable to highly automated methods.

1.5.4

Critical Assessment of Fully Automated Structure Prediction -- CAFASP

Following the first two meetings of CASP in 1994 and 1996, it became clear that there was a growing need to provide automated structure prediction to the scientific community. Up until this point, the CASP evaluation had allowed full human intervention in the process of protein structure prediction; this would usually increase its accuracy considerably. This practise was accepted as it drew the knowledge of individual scientific teams into the process and employed state-of-the-art technolo-

Introduction

59

gies at the same time. However, there were concerns that this made it difficult to delineate the quality of the prediction algorithm and the quality of the expert knowledge being applied. Since human expertise cannot be recreated between laboratories, it was crucial to separate the human element from the algorithmic element in order for progress to be made in the development of high quality prediction algorithms. Furthermore, the vast number of prediction requests received from the molecular biology community meant that the expert contribution could not be recreated on a sufficiently large scale, as the computational results required constant vetting by human experts. In order to provide a more reliable measure for the scientific community at large, a second evaluation was devised in 1998 to run along-side CASP3: the Critical Assessment of Fully Automated Structure Prediction (CAFASP -- http://www.cs.bgu.ac.il/~dfischer/CAFASP3/).

The goal of CAFASP is to evaluate the performance of fully automatic protein structure prediction servers available to the scientific community. In contrast to the normal CASP evaluation procedure, the CAFASP evaluation aims to judge how well automated prediction servers do without any intervention from human experts. The motivation behind this system is to compare the overall performance of automated methods, capable of processing many thousands of requests, without the human intervention allowed in CASP. Furthermore, additional servers (Cyber servers), such as LiveBench (http://bioinfo.pl/LiveBench/; Rychlewski et al., 2003), are able to continuously assess and reassess the quality of results produced by the prediction servers.

1.5.5

The `3D-PSSM' Server

By the fourth meeting of CASP, in 2000, the Imperial College Structural Bioinformatics Group (http://www.sbg.bio.ic.ac.uk/ -- then part of the Imperial Cancer Research Fund) had developed a fold recognition server `3D-PSSM' (Kelley et al., 2000) that used 1D and 3D-PSSMs to recognise remote homologues. Benchmark-

Introduction

60

ing of `3D-PSSM' demonstrated a marked improvement in performance compared to PSI-BLAST, demonstrating its ability to recognise remote homologies missed by standard sequence alignment methods. `3D-PSSM' was found to be the best performing, fully automatic method for structure prediction, and the best performing method for fold recognition. For a review of CASP4 see Murzin (2001). 1.5.5.1 The `3D-PSSM' Fold Library

Many of the fold recognition servers available at present work in a similar way to `3D-PSSM'. However, one of the key elements in the success of `3D-PSSM' was the data used in the construction of its template library, which was based upon a subset of the Structural Classification Of Proteins (SCOP) database (see § 1.3.1, page 20; Murzin et al., 1995; Hubbard et al., 1999), using representative protein structures from each functional superfamily. Each of these structures was processed to create four separate information files: (i) a secondary structure string; (ii) a solvent accessibility string; (iii) a 1D-PSSM; and (iv) a 3D-PSSM.

The secondary structure and solvent accessibility for each template protein was stored as a data string with each character representing a particular value for a given residue. These data were determined using STRIDE (Frishman & Argos, 1995) and DSSP (Kabsch & Sander, 1983) respectively. Even though DSSP also determines secondary structure, STRIDE was chosen to carry out this function because of its consistency with human expert secondary structure determination (Frishman & Argos, 1995).

The amino acid sequence of each template protein (A0 in Figure 1.12, page 62) was scanned against a complete sequence databank using PSI-BLAST to collate a number of homologues (A0, A1, A2, etc) which were then be used to construct a 1D-profile and then a 1D-PSSM. The same SCOP superfamily, which contained A0, also held a number of protein sequences (B0, C0) that shared a common threedimensional structure with A0 but no discernible sequence similarity. 1D-PSSMs

Introduction

61

were constructed for B0 and C0 in the same way as for A0. By structurally superpositioning the members of the superfamily, they were then hierarchically ordered and a multiple alignment of all superfamily members (and their close homologues) built. The 1D-profiles of B0 and C0 were aligned to the 1D-profile of A0, according to the structural superpositions within the superfamily multiple alignment, and then used to generate a 3D-profile for A0. Finally, the 3D-profile was converted to a 3D-PSSM and stored. This procedure was repeated to generate secondary structure data, solvent accessibility data, a 1D-PSSM, and a 3D-PSSM for every template in the library. 1.5.5.2 Query Data

To further enhance the search space coverage of `3D-PSSM', two extra pieces of information were used for every submitted query protein. Firstly, the amino acid sequence of a given query protein (Q0 in Figure 1.12, page 62) was used to construct a 1D-PSSM in the same way as for the template proteins (see § 1.5.5.1, page 60). Secondly, a secondary structure prediction for the query protein was made using the neural-network-based program PSIPRED (Jones, 1999a). 1.5.5.3 Scanning the `3D-PSSM' Fold Library

Another key element in the success of `3D-PSSM' was the nature of the functions used to assess the final score of a given QT match. In order to enhance the quality of the results in `3D-PSSM', each query protein was aligned using dynamic programming against every template protein in the fold library. The score for aligning a residue (Qi ) in the query with a residue (Tj ) in a template was calculated as the sum of scores contributed from secondary structure matches, solvation potential matches, and a PSSM alignment score. The ideal score was found using a three-pass approach whereby the query sequence was first aligned to the template, using the 1D-PSSM of the library entry, then the 3D-PSSM of the library entry, and, finally, the process was reversed and the template sequence was aligned to the 1D-PSSM of

Introduction

62

Master Library ProteinA0

1) Secondary Structure . C C C H H H . 2) Solvent Accessibility . 1 2 2 4 9 9 . PSI-Blast 3) 1D-PSSM 4) 3D-PSSM

Protein B0

PSI-Blast

Protein C0

PSI-Blast

1D-PSSM A0 A1 A2

1D-PSSM B0 B1 B2

1D-PSSM C0 C1 C2

3D Superposition

Query Sequence Q0

3D-PSSM A0 A1 A2 B0 B1 B2 C0 C1 C2

PSI-Blast

1D-PSSM Q0 Q1 Q2

PSI-Pred

Dynamic Pr ogramming

Secondary Structure Prediction

Score the alignment between query Q0 and template A0

A diagram of the flow of information in the `3D-PSSM' system. For each master protein (A0) in the structural library, four types of information are derived: (1) solvent accessibility; (2) three-state secondary structure; (3) a 1D-PSSM of homologues to A0 (A1, A2) found using PSI-BLAST; and (4) a 3D-PSSM from the multiple structural alignment of SCOP-derived structural homologue 1D-PSSMs. For the query sequence, two further types of information are derived: (1) a 1D-PSSM of homologues found with PSI-BLAST; and (2) a secondary structure prediction made using PSIPRED. All of these types of information are combined in a three-pass dynamic programming algorithm, and in the resulting score. See § 1.5.5.3 (page 61) for a detailed explanation. Dotted lines indicate the flow of information during bi-directional scoring, where a library sequence is matched to a query PSSM. Based on a figure taken from Kelley et al. (2000).

Figure 1.12:

Introduction

63

the query (bi-directional scoring). The procedure for secondary structure matching and solvation potential matching was carried out in parallel with these alignments and was the same throughout. The highest score of the three passes was used as the final result for the specific QT match. This procedure was repeated against all the templates in the fold library, the scores assigned a level of statistical significance, and the top answers displayed.

The success of `3D-PSSM' at CASP4 can be credited to several factors in its design: good engineering of individual components, a coherent strategy backed up by a robust theoretical foundation, and a continuously updated fold library utilising the most recent structural data in order to produce the most cutting-edge predictions.

1.6

CASP5 -- Fold Recognition with Ensemble Systems

The CASP5 meeting, in 2002, highlighted new trends in the field of protein structure prediction (Valencia, 2003), in particular, the emergence of Meta servers (Bujnicki et al., 2001). By combining the results of many different stand-alone servers (including `3D-PSSM'), Meta servers attempt to mask the shortcomings of individual methods by drawing on the strengths of the group as a whole. Other servers tested at CASP5 extended this idea by mixing and recombining the results of the Meta servers to further enhance prediction accuracy (Meta N servers). The overall result was that many of the better stand-alone servers became victims of their own success by being assimilated into various Meta servers.

1.6.1

Meta Servers

The first fully automated Meta server, Pcons (Lundstrom et al., 2001), worked by collecting the outputs of six different, publicly available protein fold recognition servers. It used a set of neural networks to predict the quality and accuracy of the

Introduction

64

of all the collected models. Even though Pcons was specifically trained to predict the quality of the final models, rather than whether or not they were of the correct fold, it did allocate higher final scores to folds that were predicted by more than one server. All Meta servers made available since then work on a similar basis; they select their final answer from a set of results, using a consensus approach. The strength of Meta servers lies in the theory that mistakes in predicted models are likely to be random, whilst accurate models will occur at a frequency greater than random.

There are many issues that must be addressed in the development of a Meta server. First and foremost, any structural comparisons must be quick (see § 1.6.2, page 65). Collecting the top n results from m individual servers requires

(nm)2 2

- nm

separate comparisons, so doubling the number of models per server (or the number of servers used) increases the number of comparisons approximately four fold. Secondly, when no structural grouping can be found between models it may be advantageous to consult the scores assigned to each model by their respective prediction method; however, since different scoring schemes and structural libraries are used by different servers, it is unlikely that such scores will be comparable. Thirdly, if scores are not comparable, it becomes necessary to adopt a server-specific protocol in order to normalise the values. Such systems are often weak because of server dependency; if any one server suddenly becomes unavailable, the results obtained from the remaining servers may be inaccurately skewed by the normalisation process. Some protocols try to make allowances for such occurrences (e.g. Pcons), while others choose to ignore the server-assigned score altogether (e.g. 3D-JURY; see § 1.6.3.2, page 76). Finally, the end result from the Meta server will either be one of the initial models used in the calculation (such a system is used in both Pcons and 3D-JURY), or further modifications will be made before it is returned to the user. The 3D-SHOTGUN server (Fischer, 2003) combines fragments from the initial models based upon the clustering of individual residues during structural superposition. The Robetta server (Chivian et al., 2003) takes a given Meta server result and uses

Introduction

65

an ab initio protocol (Rosetta; Simons et al., 1999) to restructure various regions using fragment assembly (see § 1.6.3.3, page 77). Usually it is technical aspects, such as time delays between results obtained from external servers, that limit the overall complexity of a Meta server system.

Some groups have developed the necessary components of a Meta server inhouse. The Shotgun-INBGU server (Fischer, 2003) uses a Meta predictor layer on top of the five prediction components of the original INBGU server (Fischer, 2000): a sequence-sequence comparison; a comparison between a template sequence and a multiple sequence alignment of a query (constructed using PSI-BLAST); a sequence-profile comparison; a profile-sequence comparison; and a comparison between a template profile and a multiple sequence alignment of a query (constructed using PSI-BLAST). Meta-BASIC (Bilaterally Amplified Sequence Information Comparison; Rychlewski et al., 1998; Ginalski et al., 2004) uses two search algorithms to perform gapped alignments using meta-profiles (see Figure 2.1(a), page 102), similar to those used by ORFeus (Ginalski et al., 2003). A meta-profile adds predicted secondary structure preferences as additional parameters to the PSSM. The secondary structure predictions are based solely on the sequence profiles themselves, so no actual higher level structures are needed. However, they do seem to improve the overall accuracy of the algorithm. The primary advantage of in-house development is that it grants the developer complete control over all aspects of the process, not only can results from individual methods be rescaled and standardised, but template libraries can be tailored for consistency and breakages in the pipeline (e.g. a server failing) can be fixed as and when needed.

1.6.2

Evaluation of Fold Recognition Predictions

In order to keep the distinction between the different protein structure prediction disciplines, the evaluation at CASP5 was divided into three major categories: Comparative or Homology Modelling (CM), Fold Recognition (FR), and New Fold meth-

Introduction

66

ods (NF), which were previously categorised as ab initio methods. Originally, the assessment categories were named after the most effective types of techniques used to generate structure predictions for the target proteins (the query proteins whose structures are unknown to the evaluation participants). However, more recently, the distinctions separating these categories have blurred as prediction methods have improved. For example, fold recognition target proteins have traditionally included domains that fell in between the more clearly defined comparative modelling targets (displaying sequence similarity to known folds) and the new fold targets (displaying no structural similarity to known folds), with some overlap (Kinch et al., 2003b). However, due to the increasing power of sequence similarity detection methods, the distinction between comparative modelling and fold recognition domains has become harder to define.

The target proteins used in the CASP evaluation are now classified on the basis of the degree of sequence and structural similarity to known folds; as a result, the assessment categories have become more like graduation points along a sliding scale rather than distinctive subgroups. To reflect these changes, the FR category has been further subdivided into homologues (FR(H)) and analogues (FR(A)) based on evolutionary considerations, and the overlap between assessment categories are classified as CM/FR(H) and FR(A)/NF. A full description of how the target proteins were classified for CASP5 can be found in Kinch et al. (2003a).

In order to evaluate the overall quality of a set of structure predictions, for a given target protein, each prediction must be compared with the experimentally determined structure by considering both structural similarities and alignment quality. There are many, varied programs that can be used to compare a protein model with a native structure. Table 1.1 provides a short description of some of the most frequently used assessment methods. Most of the methods use rigid body superposition algorithms to find the best structural alignment in a manner that is either dependent or independent of sequence. Sequence-independent methods ignore the iden-

Introduction

67

tities of the residues in the predicted model, thereby ignoring potential alignment errors between the query and the template, and concentrating on the assessment of the general shape of the model. This is essentially the same as verifying that the architecture and topology of the template is an accurate match to the native structure of the query protein. Such methods are computationally intensive, since they must scan all possible residue-residue superpositions between the model and the native structure. Sequence-dependent methods perform similar comparisons but make their assessments based upon residue equivalences between the model and the native structure. Since no single standard measure to judge these comparisons exists, a combination of various structural comparison methods was used by the organisers of CASP5.

Table 1.1: Selected evaluation measures used to assess the quality of 3D models. Based on a

table taken from Ginalski et al. (2005). GDT TS (Global Distance Test) (Zemla et al., 1999b) measure performs sequencedependent or sequence-independent superposition of the model and the native structure and calculates the number of structurally equivalent pairs of C-alpha atoms that are within specified distance d. The GDT TS score is the average of four scores obtained with d = 1, 2, 4, 8° divided by the number of residues of the target. Despite being A slow, GDT TS is the standard measure used in CASP, but it is not part of LiveBench evaluation. LG-score (Levitt & Gerstein, 1998) superimposes the model with the native structure to maximise the Levitt-Gerstein score, as in MaxSub (see below). The final score is translated into a P-value (see Appendix A.1, page 231), which estimates the chance of obtaining this score given the length of the model. LG-score can operate in sequencedependent and sequence-independent modes. The second is much slower. Because of limited computational resources, it has been removed from standard LiveBench evaluations.

continued on next page

Introduction

68

continued from previous page

MAMMOTH (Ortiz et al., 2002) computes the optimal similarity of the local backbone chains to establish residue correspondences between residues in both structures in the first step. In the second step, the largest subset of residues found within a given distance threshold is calculated with MaxSub (see below). This sequence-independent structural similarity is translated into P-values. MaxSub (Siew et al., 2000) identifies the largest subset of C-alpha atoms of a model that superimpose well (below 3.5°) over the experimental structure. MaxSub calculates A a variant of the Levitt-Gerstein score (Levitt & Gerstein, 1998), which equals to {1/[1+

° (d/3.5A)2 ]}, summed over all superimposed pairs of C-alpha atoms and divides it by the number of residues in the target. MaxSub is the official CAFASP evaluation method. 3D-score (Rychlewski et al., 2003) optimises the sum of exp[- ln(2) × (d/3°)2 ], where A d is the distance between the superimposed C-alpha atoms. This sum behaves very similarly to the score used in MaxSub or LG-score, but it has no cutoff value and it decays faster with higher distance. The final score is not divided by the length of the target. CA-atoms< 3° (Rychlewski et al., 2003) returns the maximum number of atoms within A 3° after superposition generated by optimisation of the 3D-score. This very simple A measure shows good performance in distinguishing biologically relevant predictions and is very intuitive and easy to understand. Q(CA-atoms< 3°) is aimed at evaluating the specificity of the alignment and penalises A incorrectly aligned sections of the models. It is equal to the square of CA-atoms < 3° A divided by the number of residues in the model. This is the only measure used in LiveBench, which penalises overpredictions (too long alignments). Servers that return coordinates always for all residues of the target perform worse than if evaluated with other measures. Contact(A&B) (Rychlewski et al., 2003) calculate the distance map overlap between the model and the native structure. The calculation is performed in a sequence-dependent manner and no rigid body superposition is required. Two ways to normalise the overlap are used resulting in two scores Contact(A) and Contact(B). These two are the only contact measures used in LiveBench.

continued on next page

Introduction

69

continued from previous page

Methods performing sequence-independent superposition (first three) are relatively slow and are not used in current LiveBench experiments. Only one measure [Q(CA-atoms< 3°)] penalises for wrong parts of A models. All methods, except the contact measure [Contact(A&B)], conduct rigid body superposition. The contact measure can handle the evaluation of multiple domains. GDT TS and MaxSub divide the score by the size of the target. MAMMOTH and LG-score estimate the probability of non-random structural similarity expressed as E-values. The scores of the others are proportional to the size of the model.

The Global Distance Test Total Score (GDT TS) and the Segment Overlap Measure Observed score (SOV O) are two sequence-dependent (i.e. aligned according to residue equivalence) techniques that have been developed over the time course of CASP evaluations (Zemla et al., 1999b, 2001). The GDT TS represents the average percentage of residues that can be superimposed within a given distance over four optimal sequence-dependent superpositions (1, 2, 4, and 8 °). GDT TS analysis A has been generally accepted by the structural prediction community and has provided the basis for previous CASP assessments (Kinch et al., 2003b). Although the GDT TS reflects the overall tertiary structural quality of a model prediction, the SOV O score provides an assessment of prediction quality that depends on segmentbased evaluation of secondary structure (Zemla et al., 1999a). These two scores provide alternative automated evaluations of CASP predictions that address both global (GDT TS) and local (SOV O) aspects of model quality.

Three recognised structural evaluation methods from other laboratories were also used as part of the assessment. These were: DALI (http://www.ebi.ac.uk/dali/; Holm & Park, 2000; Holm & Sander, 1993), CE (http://cl.sdsc.edu/ce.html; Shindyalov & Bourne, 1998), and MAMMOTH (http://fulcrum.physbio.mssm. edu:8083/mammoth/; Ortiz et al., 2002). Both DALI and CE compare intramolecular C geometries (distances and angles respectively) of a target structure with those of a predicted structure. Even though they use different procedures to generate op-

Introduction

70

timal sequence-independent structural alignments, they both define the quality of the results in terms of a Z-score (i.e. how far and in what direction a value deviates from its distribution's mean, expressed in terms of its distribution's standard deviation). MAMMOTH is a relatively new structural comparison method developed to evaluate structural similarities at the fold level (ideal for scoring target domains classed within the fold recognition category). It produces a sequence-independent structural alignment based on unit-vector root-mean-square distances (i.e. geometrical distances between arbitrary points along a target structure and a predicted structure, regardless of their residue position), and associates a statistical significance value to the similarity score produced. This technique allows for structural comparisons at a more general level than does DALI or CE do.

The final structural evaluation technique used at CASP5 was a sequence-dependent contact distance scoring method. This method was based on the similarities between intramolecular C contact distances of the target and predicted model structures, and was partially dependent on the GDT TS.

To evaluate the alignment quality of fold recognition predictions, the CASP assessors used four sequence-independent structural superposition methods. These were: DALI, CE, MAMMOTH, and the Local-Global Alignment (LGA; Zemla, 2003). The quality of alignments produced by these four structural superposition methods was scored independently on the basis of the percentage of correctly aligned residues. Therefore, the overall evaluation of each model included scores generated by six different structural measures and four different alignment measures for every fold recognition target prediction.

Predictors were allowed to submit up to five models for each target protein, and final evaluations were made according to the first models (i.e. the models that the predictors classed as their best results), and the best models (i.e. the models that the assessors deemed to be the best-quality -- defined as the model with the highest

Introduction

71

score for a given measure from the 10 described above). Overall rankings appeared to be sensitive to whether the first or best models were used in the evaluation (Kinch et al., 2003b). 1.6.2.1 MaxSub and TM Score

Two other noteworthy structural assessment methods, not used at CASP5, are the MaxSub score (Siew et al., 2000) and the template modelling (TM) score (Zhang & Skolnick, 2005), both of which are sequence-dependent. MaxSub is the official evaluation method of CAFASP (see § 1.5.4, page 58) and calculates a variant of the Levitt-Gerstein (LG) score (Levitt & Gerstein, 1998). It works by attempting to identify the maximum substructure in which the distances between equivalent residues of two structures, after superpositions, are below some threshold value, usually 3.5 °. The final MaxSub score is a normalised value between zero and A 1. Since the MaxSub scoring function only counts those residues included in the substructure, the spatial information of the templates outside the substructure is omitted.

The TM score is an expansion of the MaxSub and GDT TS scores. It uses a variation of the LG score (Levitt & Gerstein, 1998) and produces a value between zero and 1, with better models having higher TM scores: 1 TM score = max LN

LT

1 1+

di d0 2

(1.6)

i=1

where LN is the length of the native structure, LT is the length of the aligned residues to the template structure, di is the distance between the ith pair of aligned residues, and d0 is a scale to normalise the match difference. The `max' denotes the maximum value after optimal spatial superposition. A similar formula is used in MaxSub, but the summation is limited to those residues with di < d0 . Rather than setting a specific distance cut-off, and calculating only the fraction of aligned

Introduction

72

residues with distances below that cut-off, all residue pairs in the structural alignment are evaluated in the TM score. One of the major motivations behind the development of the TM score was to rescale the structure modelling errors so that the evaluation score was independent of the protein sizes. A protein size-dependent scale is exploited to eliminate the inherent protein size dependence of MaxSub. Both MaxSub and the TM score were frequently used in this research.

1.6.3

The State-of-the-Art -- the Results of CASP5 for Fold Recognition

Several different rankings were calculated for the groups participating in CASP5's fold recognition assessment. Based on the combined scores of the 10 evaluation methods, for the first models for all defined fold recognition domains (46 domains in total), the results of three groups stood out as noticeably better than all the other participants. These were (in descending order) the Baker group (see § 1.6.3.1, page 75; Bradley et al., 2003), the Ginalski group (see § 1.6.3.2, page 76; Ginalski & Rychlewski, 2003) and the Rychlewski group (see § 1.6.3.2, page 76; von Grotthuss et al., 2003). Additionally, the distantly ranked fourth place was the Robetta group (see § 1.6.3.3, page 77; Bradley et al., 2003), which used the automated Robetta server.

Changes in rankings occurred when considering the mean of the first model scores, or the combined average and mean scores of the best model predictions.

The highest score for the combined best model predictions was, again, achieved by the Baker group. However, they were closely followed by the Skolnick group (see § 1.6.3.5, page 82; Skolnick et al., 2003), who were distantly followed by the Ginalski group. Interestingly, the three automated servers included in the overall top 20 all ranked highly in the combined best model predictions: the Pmodel3 group placed third along side the Ginalski group; the Robetta group placed a respectable sixth;

Introduction

73

and the Pmodel group placed seventh.

The Venclovas group and the Murzin group, who did not rank highly in the first model combined scores list, emerged in the top 20 when the first model mean score was used (positions 1 and 5 respectively). The Baker and Ginalski groups occupied second and third ranking positions, respectively, behind the Venclovas group. Both the Venclovas and Murzin groups failed to place in the first model combined score rankings because they predicted fewer target domains than the other groups (the Venclovas group with 7 domains and the Murzin group with 22 domains, out of a possible 46).

In the best model mean score rankings, the Baker group were ranked first while the Skolnick group ranked third.

Many of the top performing fold recognition groups also performed well in the comparative modelling and the new fold categories. The strengths of the individual groups were emphasised when the fold recognition domains were divided into homologues (30 domains) and analogues (16 domains) of known folds. The top achieving groups for fold recognition of homologues performed well in the comparative modelling assessment (Tramontano & Morea, 2003). These were the Ginalski, Rychlewski groups and the Bujnicki group (see § 1.6.3.4, page 80; Kosinski et al., 2003). For fold recognition of analogues, the Baker group outscored all other methods. Notably, the automatic server of the Robetta group (developed by the Baker laboratory) also ranked an impressive third and fifth, for best models and first models respectively, for fold recognition of analogues.

When comparing the structural and alignment quality of target models for individual groups, the same groups tended to perform better than others: using structural measures the Ginalski, Baker and Rychlewski ranked the highest; using alignment measures the Baker and Ginalski groups ranked the highest.

Introduction

74

Table 1.2: The top 20 predictors from CASP5 ranked by combined scores (all domains). Based

on a table taken from Kinch et al. (2003b).

Rank First (sum)a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

a b

Rank Best (sum) 1 3 8 6 13 2 25 18 11 23 16 5 14 9 31 17 45 20 7 3

Rank First (mean)b 2 3 6 10 7 12 14 17 4 13 15 22 19 27 18 31 32 29 23 21

Rank Best (mean) 1 6 13 12 14 3 32 23 5 21 16 9 22 19 28 29 72 31 10 8

Group

Predictorc

Predictions scored 46 46 46 46 43 46 46 46 33 36 39 46 42 46 39 46 46 46 46 46

2 453 6 29 517 10 96 427 20 110 28 12 450 373 153 112 67 1 40 45

Baker Ginalski Rychlewski Robetta (S) Bujnicki Skolnick Bates Fischer Bujnicki Honig Shi Xu Labesse Brooks Takeda-Shitaka Friesner Jones Karplus Pmodel (S) Pmodel3 (S)

Sum refers to combining target scores by summation. Mean refers to combining target scores by averaging Z-scores. c Automatic servers are indicated with (S) following predictor name.

Overall, other than the exceptions in the first model mean score rankings, the rankings produced by each method followed the same general pattern; most of the top performing groups predicted all 46 target domains. The outstanding groups in the field were the Baker, Ginalski, Rychlewski, Robetta, Bujnicki, and Skolnick groups. A summary of the rankings are listed in Table 1.2 (page 74).

Introduction

75

1.6.3.1

Rosetta -- the Baker Group

The Baker group used the Rosetta fragment insertion protocol as the basis for their human-assisted predictions. Rosetta was originally developed as a solution to the problem of de novo protein structure prediction (Simons et al., 1997, 1999; Bonneau et al., 2002). Later, it was extended to model evolutionarily variable regions (e.g. extended loops, domain insertions, and N- and C-terminal extensions) within models of template proteins built using established comparative modelling methods (i.e. homologous structure information). The Rosetta method of de novo protein structure prediction is based on the assumption that the distribution of structural conformations possible for each three- and nine-residue protein segment is reasonably well approximated by the distribution of structures adopted by identical protein segments (and closely related sequences) in proteins of known structure. For the purposes of prediction, a fragment library of each three- and nine-residue segment is constructed by extracting fragments from a protein structure database using a primary and secondary structure profile-profile comparison method. The secondary structure for a query protein is predicted using various algorithms.

When building a model using the Rosetta protocol, the conformational space spanned by the fragments is searched using Monte Carlo simulations with an energy function that favours burial of hydrophobic residues and strand pairing, and is weighted against steric clashes (Bradley et al., 2003). For each target protein sequence, a large number of potential structures are generated and then clustered; the largest clusters are then chosen as the final predictions. When performed around a template model, the protocol is refined so that the conformational space of the fragments can be searched in the context of the template model itself -- i.e. energetically favourable conformations are determined on the basis of the structure suggested by the template. For the purposes of CASP5, template-based models were first built, then insertions, loops, and extensions with low-sequence similarity to template homologue, were modelled using fragment insertion. In cases where the

Introduction

76

system was unable to detect a homologous template, the entire target sequence was modelled by fragment insertion.

The success of the Baker group at CASP5 was unquestioned. However, the Rosetta protocol did display several limitations, namely problems with domain parsing and complex topologies. In some cases, Rosetta was unable to correctly identify the individual domains contained within some of the larger multi-domain target sequences and, therefore, the Baker group was unable to begin predicting accurate models for these targets. Rosetta struggled with especially complex topologies, sometimes predicting overly simplified alternatives to the ideal structure. However, its ability to predict reasonably accurate models of increasingly complex topologies had improved since CASP4. 1.6.3.2 3D-JURY -- the Ginalski and Rychlewski Groups

The 3D-JURY system (Ginalski et al., 2003) was used by both the Ginalski and Rychlewski groups to great effect. The premise of this system is very simple; like other Meta prediction methods, the 3D-JURY system incorporates a comparison of models as the main processing step. It follows an approach similar to that employed in the field of ab initio fold recognition, where experience has lead to the conclusion that averages of low-energy conformations, obtained most frequently by folding simulations, are closer to the native structure than the conformation with the lowest overall energy. It is also assumed that the higher the number of prediction techniques that are pooled, the less likely they are to all make the same mistakes; this is the main rationale behind the 3D-JURY technique.

A full description of the 3D-JURY algorithm is available from Ginalski et al. (2003). Briefly, the algorithm takes as its input groups of models, generated by a variety of servers, regardless of their assigned confidence scores. All the models are compared with each other, and a similarity score (MaxSub; see § 1.6.2.1, page 71) is assigned to each pair, which is equal to the number of C atom pairs

Introduction

77

that are within 3.5° of each other after optimal superposition. If the score is below A a certain threshold (40, by default), the pair of models is annotated as not similar and the score is reset to zero. The final 3D-JURY score of a model is the sum of all similarity scores of considered model pairs, divided by the number of considered pairs plus one. The algorithm does not modify the final structure chosen as the best.

The Ginalski and Rychlewski groups used the 3D-JURY system as their main method for template selection during CASP5. Once a list of reliable template models had been chosen from individual servers, the target sequence was aligned to each template structure and the final models assessed for quality. The Ginalski group employed a more manual approach to align the target sequence to the template model, using human intervention to enhance the progress at each stage. The Rychlewski group used a more automated approach, constructing the final model by splicing together fragments of different template models, usually of the size of supersecondary structures. The splicing procedure was conducted by superimposing all chosen template models and selecting the model fragments of structurally diverged regions that showed the highest VERIFY3D scores (Eisenberg et al., 1997). Structurally conserved regions were usually taken from the model with the highest 3D-JURY score.

The final rankings of both the Ginalski and Rychlewski groups at CASP5 demonstrated the power of Meta servers (and specifically the 3D-JURY system) when compared to other fold recognition methods. 1.6.3.3 Robetta Server -- the Robetta Group

Robetta is a fully automated prediction server developed by the Baker laboratory (Chivian et al., 2003). Its power and success rate are based on the combination of prediction algorithms it uses when generating potential model structures. The first step in the Robetta algorithm is the automatic detection of the locations of domains within a query sequence and the selection of the appropriate modelling protocol for each domain. If a homologue of experimentally determined structure

Introduction

78

can be found using an available search tool (e.g. PSI-BLAST or the Pcons Meta server), Robetta uses its own alignment algorithm, K*Sync (paper in preparation), to align the query sequence to the template structure. It then models the variable regions of the template by exploring conformational space with protein fragments in a manner similar to the Rosetta protocol, but in the context of the template structure.

When no structural homologue is available, domains are modelled with the Rosetta ab initio fragment insertion method (see § 1.6.3.1, page 75), which explores the conformational space for the full length of the query sequence, using fragment insertion. This alternative de novo method produces a range of potential structures, from which any non-protein-like conformations are filtered out, and, subsequently, the remaining structures are clustered to identify broad low energy minima. The final step, in the de novo modelling protocol, consists of selecting four final models from the most densely populated clusters and one model that is the lowest energy structure remaining outside of the top cluster.

If a given target sequence possesses more than one domain, separate domain models are combined into one full-length model, using fragment-insertion to include a putative linker region in order to provide chain connectivity and attempt domain association.

Part of the success of Robetta at CASP5 was due to the group's use of its K*Sync alignment algorithm, which simultaneously uses primary structure profileprofile comparison, secondary structure prediction, and information about elements that are obligate (i.e. evolutionarily important) to the fold in a local or global-local dynamic programming approach (see § 1.4.2, page 24), to produce a single default alignment. The profile-profile comparison matrix used in the primary structure alignment is constructed to produce a score distribution, which is adjusted to possess a mean value just below zero and a standard deviation of 1.0, in the same way as in FFAS (Rychlewski et al., 2000). Template residue profiles are adjusted to include

Introduction

79

counts from an FSSP multiple structural alignment (Holm & Sander, 1996) to allow residue sampling from more distant homologues. Secondary structure is added to the pairwise scoring scheme by giving extra points to matches between a predicted query regular secondary structure from PSIPRED (Jones, 1999a) and an assigned template regular secondary structure from DSSP (Kabsch & Sander, 1983), and penalising any mismatches. Whether an element is obligate is determined by the FSSP multiple structural alignment (in terms of the template), and by the PSI-BLAST multiple sequence alignment (in terms of the query). Positions that are usually occupied in multiple alignments are assumed to be obligate to the fold, whereas infrequently aligned positions are likely to be insertions or conformationally variable with respect to the core elements of the structure. Finally, the comparison matrix is readjusted again to restore the mean and standard deviation.

When aligning a query and a template, dynamic programming is used to produce a single default alignment, which is used to generate a model. The gap penalties and gap extension penalties used in the alignment are position specific, adjusted from a base value in order to penalise failure to align obligate elements (by increasing the gap extension penalty at such positions) or the insertion of a gap between two obligate elements (by increasing the gap initiation penalty at such positions). Any loop regions are modelled in the context of the fixed template structure using the Rosetta fragment insertion protocol.

The de novo method employed by Robetta uses an automated (but reduced) version of the Rosetta protocol (§ 1.6.3.1 page 75). Full details of both procedures can be found in Chivian et al. (2003).

The success of Robetta in the CASP5 assessment is further evidence that combining many levels of structural data (including using profile-profile alignments for primary structure information) increases the quality of the modelling templates that are available for any given query sequence. To create a successful, individual auto-

Introduction

80

mated server would require merging many different types of structural data in order to produce a range of high quality decoy models from which the final models could be selected or built. 1.6.3.4 FRankenstein's Monster -- the Bujnicki Group

The Bujnicki group applied a novel multi-step protocol during CASP5 to predict the structures of all types of target sequences, regardless of their potential modelling category (Kosinski et al., 2003). The approach was named `FRankenstein's Monster' (FR -- fold recognition) because of the modular nature of the final models.

As a prerequisite to the modelling procedure, for each target protein, a sequence profile is constructed, with PSI-BLAST, which is then used to construct a multihomologue alignment. These alignments are then manually refined and divided into potential domain-sized fragments, each of which is submitted to the CAFASP Meta server (http://bioinfo.pl/cafasp) and a Meta server developed by the Bujnicki group. In the first stage of this part of the process, various query-template (QT) alignments generated by the Meta servers are converted into preliminary, full-atom models, using comparative modelling. All preliminary models are evaluated by VERIFY3D (Eisenberg et al., 1997) and given a score to identify well- and poorly-folded fragments.

In the second stage, preliminary models with similar three-dimensional folds are superimposed and clustered, and the superposition is used to generate a multiple sequence alignment. Structurally superposable regions are identified and analysed for the consistency of the alignment and the quality of the sequence-structure fit, according to VERIFY3D. Low-scoring regions are deleted and a hybrid `consensus' model is created from high-scoring fragments of models corresponding to the cluster with the biggest population. If no recurring fold can be identified among the results, then the highest-scoring preliminary model is selected for further analysis.

Introduction

81

In the third stage, a multiple structure alignment is created by pairwise superposition of all template structures plus the consensus model of the target protein. The most diverged elements of the template structures (e.g. large insertions not present in other templates or the consensus model) are removed and the remaining parts of the template structures are regarded as a composite multiple-structure template. Based upon this superposition, a structure-based target-template sequence alignment is inferred. This alignment is then used to build a new (intermediate) comparative model of the target.

In the fourth stage, the intermediate model is again evaluated with VERIFY3D. For all low-scoring regions, a series of alternative alignments is generated by progressively shifting `unfit' sequence fragments in either direction. At this point, additional information is considered, such as secondary structure, placement of insertions and deletions in loop regions, conservation of putative catalytic residues, and the need to have a compact, well-folded structure. As a result, the sequence/structure space is explored beyond the alignment variants reported by the fold recognition servers.

Finally, for all the alternative alignments, new models are built and evaluated by VERIFY3D. All the models are superimposed and the `FRankenstein's Monster' model is built from the highest-scoring segments. The final model is obtained after limited energy minimisation is carried out to remove steric clashes between sidechains from different fragments.

The novel aspect of this approach is its emphasis on recombination of structure fragments, more common in ab initio methods, rather than the sequence-based alignments, which are typical of comparative modelling. However, the fact that `FRankenstein's Monster' was outperformed in the CASP5 fold recognition category by consensus prediction methods (e.g. 3D-JURY) only demonstrates the power of Meta techniques.

Introduction

82

1.6.3.5

TOUCHSTONE -- the Skolnick Group

The Skolnick group competed in CASP5 with the TOUCHSTONE structure prediction algorithm (Skolnick et al., 2003), one of the few top-performing methods that did not utilise input from Meta servers.

TOUCHSTONE uses a threading algorithm called PROSPECTOR (Skolnick & Kihara, 2001), an iterative algorithm, which uses two types of sequence profile as input: a closely related sequence profile (composed of sequences of 35­90% primary structure identity), and a remotely related sequence profile. These sequence profiles are used in a threading step to screen the structural database and provide an initial alignment of the query sequence to template structures. The same database is then screened again with the same sequence profiles, but, this time, with additional information from secondary structure and pair iteration profiles. For each of these four scoring functions, the top five structures in the database are recorded. These top 20 structures are pooled and used to identify which residue contacts within the structures are likely to be important. This is done by constructing a protein-specific pair potential library, based on consensus side-chain contacts that occur in at least five of the 20 structures. Finally, the closely and remotely related sequence profiles, secondary structure profiles, and the newly constructed protein-specific pair potential library, are used again to screen the database and record the top 10 highest scoring template structures.

Models for the query sequence are built from the top templates and refined using random permutations for poorly aligned regions and ab initio folding.

TOUCHSTONE performed well in fold recognition of homologues. However, its heavy reliance on primary structure data limited its success in the harder prediction categories.

Introduction

83

1.7

Ensemble Theory -- Links to Fold Recognition

The evaluation of fold recognition techniques at CASP5 showed that the Meta servers performed better that any stand-alone server. The review by Kinch et al. (2003b) observed that, since different groups performed better on different targets, sharing techniques would probably lead to better overall performance.

The power of the Meta approach is unquestionable. However, it is difficult to isolate the fundamental source of its enhanced performance over stand-alone methods. A similar result is observed in the study of ensemble systems in machine learning. Ensemble methods are learning algorithms that construct a set of classifiers and then combine their individual decisions in some way (typically by weighted or unweighted voting) to classify new examples. Such classifiers have been successfully applied to diverse areas of research, which aim to emulate human skills, e.g. face recognition (Gutta et al., 1996; Huang et al., 2000), character recognition (Mao, 1998), scientific image analysis (Kuncheva et al., 2000), medical diagnosis (Breiman, 1999; Zhou et al., 2002), etc. From the point of view of protein structure prediction, the classification of a protein can be defined by the fold or superfamily to which it belongs (i.e. when comparing an unknown query to a known template, are they homologous/analogous or unrelated). Hence, the individual fold recognition algorithms can be viewed as non-binary classifiers. Ensembles are of great interest to the machine learning community since they tend to be more accurate than the individual classifiers that constitute their component parts (Dietterich, 2000).

It has been proposed that a necessary and sufficient condition for an ensemble of classifiers (or hypotheses) to be more accurate than any of its individual members is that the classifiers are accurate and diverse (Hansen & Salamon, 1990). An accurate classifier is one that has an error rate (i.e. proportion of misclassified instances) less than that of a random classifier. Any two classifiers are diverse if they make different

Introduction

84

errors when processing new inputs. This reasoning is demonstrated in the following example from Dietterich (2000) involving three separate binary classifiers h1 , h2 , and h3 and an input x. If the three classifiers are identical (i.e. not diverse), then when h1 (x) is wrong, h2 (x) and h3 (x) will also be wrong. However, if the errors made by the classifiers are uncorrelated, then when h1 (x) is wrong, h2 (x) and h3 (x) may be correct, so a majority vote will correctly classify x. More precisely, if the error rates of L binary classifiers are all < 0.5 (an error rate of 0.5 for a binary classifier would be considered as random), and if the errors are independent, then the probability that the majority vote will be wrong will be the area under the binomial distribution produced when more than L/2 classifiers are wrong. For a simulated ensemble of 21 binary classifiers, each having an error rate of 0.3, the area under the probability distribution for 11 or more classifiers being simultaneously wrong is 0.026, which is much less than the error rate of the individual classifiers. However, if the individual classifiers make uncorrelated errors at rates higher than 0.5, then the error rate of the ensemble will increase as a result of the majority voting scheme. As a result, simple weighted binary ensemble systems require that each of its classifiers has an error rate below 0.5, and that their combined errors are at least reasonably uncorrelated. When dealing with non-binary continuous classifiers, as in fold recognition, determining the minimum required accuracy is much more difficult.

1.7.1

Protein Fold Recognition Ensembles

This general characterisation of the ensemble problem is helpful, but does not address the fundamental question of whether it is possible to rationally construct a good ensemble based on a prior theory about which combination of classifiers should work best. From the perspective of protein fold recognition, such a question is important in order for there to be any possibility of harnessing the power of ensembles and maximising their efficiency. Generally, there are three factors that contribute to the success of an individual protein structure prediction method: 1. Nature of the algorithm. This is the area of greatest variability with respect

Introduction

85

to structure prediction. Algorithms range from sequence-sequence alignment methods in homology modelling, to fragment assembly new fold approaches (see § 1.6.3.1, page 75). 2. Search parameters. Of equal importance to the underlying algorithm are the parameters used in it. This is usually the subject of vast amounts of benchmarking, designed to find the optimal parameter sets in order to maximise the efficiency of the chosen algorithm. Together, the algorithm and the parameter sets make up the programmable part of the protein structure prediction classifier. 3. Nature of the data being searched. For fold recognition methods, the nature of the data amounts to the part of the classifier that defines its search space. Any such classifier is only as good as the potential answers it can give, so, in order to classify a protein of unknown structure, an existing example needs to be present in the databank. For new fold methods, the search space is the allowed physical space as defined by energy functions. The nature of the protein structure prediction problem compounds the difficulty of attempting to rationally construct an effective ensemble classifier (i.e. using considered logic) because there are so many variables to take into account. The success of the Meta server ensemble approach might lie in the fact that detecting different remotely homologous proteins requires an emphasis on different biological features. Alternatively, it may be a signal-boosting procedure to permit weakly detectable homology signals to rise above the background noise. This problem is particularly relevant given that the classification of proteins into categories (i.e. fold or superfamily) has no strict, definable rules, and sometimes it can be just as difficult to decide to which category a protein belongs when its structure is known. Thus, there is the additional problem of cases where the correct classification for a given query protein may not necessarily be the most useful classification, from the point of view of building an accurate model. See § 2.4.2 (page 113) for details of the measures

Introduction

86

taken to avoid the latter problem during this research.

In order to use ensemble theory for the purposes of designing a Meta predictor, it is important to understand some of the fundamental theories as to how they work: when used in machine learning, ensembles are relatively easy to construct. According to Dietterich (2000), this stems from three main issues relating to any typical machine learning classifier: 1. Statistical. A learning algorithm can be viewed as searching a space H of hypotheses to identify the best hypothesis in that space. Statistical problems arise when the amount of training data available is too small, compared to the size of the hypothesis space. Without sufficient data, the learning algorithm may find many different hypotheses in H that all give the same accuracy when trained on the training data. By constructing an ensemble from these accurate classifiers, the results can be `averaged'.(see Figure 1.13(a), page 87). 2. Computational. Many learning algorithms work by performing some form of local search that may become trapped in local optima in the hypothesis space. In cases where there is enough training data (so the statistical problem is absent), it may still be computationally difficult computationally for the learning algorithm to find the best hypothesis. An ensemble constructed by running the local search with many different starting parameters may provide a better approximation to the true unknown function than any of the individual classifiers (see Figure 1.13(b), page 87). 3. Representational. In most applications of machine learning, the true function (f ) cannot be represented by any of the hypotheses in H. By forming weighted sums of hypotheses drawn from H, it may be possible to expand the space of representable functions (see Figure 1.13(c), page 87). In fold recognition, since algorithm, parameters, and search data all contribute to the structure classifier in varying degrees, they all influence the statistical issue

Introduction

87

Statistical

H

Computational

H

(a)

(b)

h2 h1 h4 f h3

h1 h2 f h3

Representational

H

(c)

h1 f h2 h3

Figure 1.13: Three fundamental reasons why an ensemble may work better than a single classifier; for a detailed description see § 1.7.1 (page 84). (a) Statistical. The outer curve denotes the hypothesis space H. The inner curve denotes the set of imperfect hypotheses that all give good accuracy on the training data. The point labelled f is the true function, which can be approximated by averaging the imperfect hypotheses. (b) Computational. Starting the search algorithm with a range of starting parameters causes the hypotheses to converge in local optima. Again, averaging the hypotheses gives a closer approximation of f than any single hypothesis. (c) Representational. The true function (f ) cannot be represented within the search space H. However, the average of the three hypotheses still gives a closer approximation than any single hypothesis. Based on a figure taken from Dietterich (2000) (figure 2).

Introduction

88

in some way. The same is true of the computational issue. However, this issue is more highly influenced by the nature of the function used to scan the search space (e.g. genetic algorithm, simplex, etc). In machine learning, the representational issue is rather subtle because there are many algorithms for which H is, in principle, the space of all possible classifiers. For example, neural networks are very flexible algorithms; changing the parameters of a neural network changes the fundamental workings of the algorithm, e.g. feedback weights on the network layers. Given enough training data, neural networks will explore the space of all possible classifiers (Hornik et al., 1990). However, with a finite training sample, these algorithms will explore only a finite set of hypotheses and will stop searching when they find an adequate hypothesis to fit the training data. From the point-of-view of fold recognition, the representational issue is, perhaps, the most important of the three; it is highly unlikely that any single prediction method in current use has the potential to be a perfect structural classifier, simply because the hypothesis space they encompass cannot represent the true function. The structure prediction problem (as a whole) is currently too complex for human experts to describe in terms of fundamental principles. As a result, the statistical and computational issues become markedly less important in comparison, and so less necessary to address.

The three fundamental machine learning issues are the most important ways in which existing machine learning algorithms fail. Hence, ensemble methods have the potential of reducing such shortcomings. However, this theory does not extend so readily to the real-world scenario of protein structure prediction. The nature of a particular prediction algorithm, e.g. sequence-sequence comparison, may have a theoretical maximum success rate far below that of a perfect classifier, regardless of what the search parameters or the databank are. In fact, its theoretical maximum may be well below the ideal minimum success rate, depending on the difficulty of the inputs to be classified. Since most structure prediction methods in current use are inflexible (compared to neural networks or decision tree algorithms), it is unlikely that any one of them could eventually perform as a perfect classifier. Ad-

Introduction

89

ditionally, it must be remembered that protein structure classification (i.e. folds, superfamilies, etc) is a human-devised description of a natural phenomenon, and, sometimes (though now-a-days less often), the decision to classify a novel protein in a particular way is based on the opinion of human experts rather than the results of computational measures. By machine learning standards, the biological categorisation of protein structures is an imperfect hierarchy, undefined by mathematical delimiters. As a result, the above rules of machine learning ensemble classifiers cannot hold in all cases. However, the success of Meta server methods has shown that such ensemble approaches can work in a way that is practical for the purposes of fold recognition. Therefore, factors which contribute to the success of machine learning ensembles, and some of their techniques of construction, may still be useful.

As part of this research, several fold recognition ensembles were built using wellestablished techniques from the field of computer science (see § 4, page 170); these were Bagging, Boosting, and Support Vector Machines (SVMs). The summaries below give a brief description of these methods and are taken, in part, from Bauer & Kohavi (1999).

1.7.2

Ensemble Notation

In order to understand the workings of ensemble theory, it is important to be familiar with the necessary notation. In the context of an ensemble classifier, a labelled instance is a pair x, y , where x is an element from space X and y is an element from a discrete space Y . Let x represent an attribute vector of a set size and y, the classification label associated with x for a given instance.

A sample S is a set of labelled instances S = { x1 , y1 , x2 , y2 , . . . , xn , yn }. The instances in the sample are assumed to be independently and identically distributed (i.i.d.).

Introduction

90

A classifier (or hypothesis) is a mapping from X to Y . A deterministic inducer is a mapping from a sample S, referred to as the training set, which contains n labelled instances, to a classifier (i.e. the algorithm is trained to optimise itself according to the training set).

1.7.3

Bagging

The Bagging algorithm (B ootstrap aggregating) by Breiman (1996) votes classifiers generated by different bootstrap samples (replicates; see Algorithm 1, page 90). A Bootstrap sample (Efron & Tibshirani, 1993) is generated by uniformly sampling n instances from a given training set with replacement. T bootstrap samples are generated according to computational feasibility and performance trade-offs, and a classifier (Ci ) is built for each sample. A final classifier (C ) is built from the classifiers whose output is the classification that is predicted most often.

Algorithm 1 The Bagging algorithm. Based on an algorithm taken from Bauer & Kohavi

(1999).

Require: Training set S, Inducer I, integer T (number of bootstrap samples).

1: for i = 1 to T do 2: S = bootstrap sample from S (i.i.d. sample with replacement). 3: Ci = I(S ) 4: end for 5: C (x) = arg max 1 (the most often predicted label y)

yY i:C (x)=y i

6: return classifier C .

For a given bootstrap sample, an instance in the training set has probability 1 - (1 - 1/n)n of being selected at least once in the n instances that are randomly selected from the training set. For large values of n, this is approximately 1 - 1/e = 63.2%, which means that each bootstrap sample contains about 63.2% unique instances from the training set. According to Breiman (1994), this causes different classifiers to be built when the inducer is unstable (e.g. neural networks, decision trees). The performance of the ensemble may improve if the induced classifiers

Introduction

91

are accurate and not correlated; however, Bagging may slightly degrade the performance of stable algorithms (e.g. k-nearest neighbour) because effectively smaller training sets are used for training each classifier (Breiman, 1996); because the algorithms are stable, the variability of the training sets has less of an effect on the classifiers that are produced.

The application of the Bagging technique to fold recognition is difficult because the algorithm requires a large training set. Due to the relatively small number of known protein structures (approximately 4,800 in SCOP30 version 1.65) it would be impossible to perform a thorough Bagging without resorting to using close homologues of previous instances in the training set. Since the focus of much structural prediction is on extremely remote homologues, this reduces the effective size of the training set to the point at which it becomes too small to be useful. Similarly, since Bagging relies on the underlying inducer (in the case of structure prediction, the recognition algorithm and parameters) being unstable (i.e. flexible) there is a strong chance that it may decrease the overall accuracy of the ensemble. For example, Bagging is used in computational classifiers to increase the accuracy of neural networks; as an inducer, a neural network is very flexible since altering its parameters fundamentally alters its underlying classification algorithm. By comparison, fold recognition methods have far fewer tunable parameters, therefore, that their fundamental algorithms hardly change when they are altered.

1.7.4

Boosting - AdaBoost

Boosting was introduced by Schapire (1990) as a method for improving the performance of a weak learning algorithm (i.e. an algorithm that always has a positive probability of classifying a set of instances, with an error rate of less than 0.5). Improvements were later added by Freund (1990, 1996); AdaBoost (Adaptive Boosting) was introduced by Freund & Schapire (1995). Like Bagging, the AdaBoost algorithm (see Algorithm 2, page 92) generates a set of classifiers and uses them in a voting

Introduction

92

consensus. Apart from this, the two algorithms differ substantially. The AdaBoost algorithm generates the classifiers sequentially, while Bagging generates them in parallel. AdaBoost also changes the weightings of the training instances, provided as input to each inducer, based on classifiers that were previously built. The goal of AdaBoosting is to force the inducer to learn via training instances it was previously unable to classify correctly. Given an integer (T ) specifying the number of trials, T weighted training sets S1 , S2 , . . . , ST are generated in sequence, and T classifiers C1 , C2 , . . . , CT are built. A final classifier C is formed using a weighted voting scheme in which the weight of each classifier depends on its performance on the training set used to build it.

Algorithm 2 The AdaBoost algorithm. Based on an algorithm taken from Bauer & Kohavi

(1999).

Require: Training set S of size n, Inducer I, integer T (number of trials).

1: S = S with instance weights assigned to be 1. 2: for i = 1 to T do 3: Ci = I(S ) 1 4: weight(x) (weighted error on training set). i = n

xj S :Ci (xj )=yj

5:

6: 7: 8: 9: 10:

If i > 1/2, set S to a bootstrap sample from S with weight 1 for every instance and goto step 3 (this step is limited to 25 times after which the loop is exited). i = i /(1 - i ) For-each xj S , if Ci (xj ) = yj then weight(xj ) = weight(xj ) × i . Normalise the weights of instances so the total weight of S is n. end for 1 C (x) = arg max log i

yY i:C (x)=y i

11: return classifier C .

Steps 8 and 9 in Algorithm 2 (page 92) effectively amount to: for each xj , divide weight(xj ) by 2

i

if Ci (xj ) = yj and 2(1 - i ) otherwise. Therefore, incorrect

instances are weighted by a factor inversely proportional to the error on the training set, i.e. 1/(2 i ). Small training set errors will cause weights to grow by several orders of magnitude. The proportion of misclassified instances is i , and these instances get boosted by a factor of 1/(2 i ), causing the total weight of the misclassified instances

Introduction

93

(after updating) to be half the original training set weight. Similarly, the correctly classified instances will have a total weight equal to half the original weight, and, thus, no normalisation is required.

Like Bagging, the AdaBoost algorithm requires a weak learning algorithm that produces binary classifiers with an error rate of 0.5 or less. As a result it is theoretically subject to the same disadvantages. The fact that it does not rely on the overall size of the training set is a bonus, given the limited number of available inputs for fold recognition.

1.7.5

Support Vector Machines -- SVMs

Support Vector Machines (SVMs) are a type of machine learning method introduced by Vapnik (1995, 1998). They are capable of classifying data based on an input vector of feature information (usually called a feature vector ). It is based on the concept of decision planes, where the training data is mapped to a higher dimensional space and separated by a plane defining the two or more classes of data. When used as binary classifiers (i.e. giving either the answer `yes' or `no'), an SVM creates a hyperplane (a projective subspace one dimension smaller than the classification space) that splits the represented training data into groups separated by a maximum-margin. A maximum-margin hyperplane splits the positive and negative training examples such that the distance from the closest examples (the margin) to the hyperplane is maximised. An advantage of this machine learning methodology is that SVMs are less prone to overfitting their classification models to training data. The SVM uses the training examples that best describe the correct answers (the support vectors) as part of the classification algorithm; as a result, an SVM is a `black-box' classifier (i.e. it offers answers but does not explain the reasoning behind those answers). A simple example is shown in Figure 1.14 (page 94).

The original concept for an optimal hyperplane algorithm was a linear classifier.

Introduction

94

Class 1 Data

Support Vectors

Support Vectors

Margins

Class 2 Data Hyperplane

Figure 1.14: A 2D example of a decision algorithm in an SVM. The content of each data point's feature vector determines its position in the classification space. The training data is split into two classes: class one (squares) and class two (circles). The SVM finds the data point that describe the optimal margin between the two classes; these support vectors (blue squares for class one, and red circles for class two) are used to determine the classification hyperplane, which is then used to classify any additional testing data. Because SVMs use a maximal-margin hyperplane, they are less prone to overfitting their classification models to training data than other machine learning algorithms.

However, this was later expanded to create non-linear classifiers by applying the kernel trick, proposed by Aizerman et al. (1964), to maximum-margin hyperplanes. A linear classifier uses a dot-product of feature vectors as part of its algorithm (when mapping to higher dimensional space), the result is an n × n matrix of values from a training set of n vectors. The kernel trick replaces the dot-product with a non-linear kernel function; this allows the algorithm to fit the maximum-margin hyperplane in the transformed feature space. The resulting transformed space may have a higher dimensionality, thus adding a greater degree of flexibility to the hyperplane. A simple example is shown in Figure 1.15 (page 95). Since kernels are used, the operation

Introduction

95

Hyperplane Hyperplane Lower Dimensional Space Higher Dimensional Space

Figure 1.15: An example of higher dimensional mapping by an SVM. When attempting to

distinguish between two types of data (in this case squares and circles), sometimes there is no simple solution in lower dimensional space. In such cases, an SVM can transform the feature space to a higher dimensionality by applying a kernel function. In a higher dimensional space, a decision hyperplane may be easier to construct.

performed on each of the input vectors in the transformed space is never explicitly computed (i.e. the step-by-step computational process is bypassed and the final answer is a calculated using a shortcut), another reason why SVMs are `black-box' classifiers. The reason this is useful is because it is possible to use kernels that create a transformed feature space of infinite dimension (since it is not necessary to make all the step-by-step calculations), such as the Radial Basis Function (RBF), theoretically creating maximum-margin hyperplanes of infinite dimension.

SVMs are universal learners. Remarkably, they can learn independently of the dimensionality of their feature space. They measure the complexity of their hypotheses based on the margin which they use to separate the input data, and not on

Introduction

96

the number of features. Effectively, this means that they can generalise a classification hypothesis using only relevant features, even when presented with many others, if the data is separable with a wide margin using functions from the hypothesis space. When using many different forms of input data (as is often the case in a fold recognition ensemble), this ability may be very useful when rational construction of an ensemble is not possible. However, because SVMs learn independently of feature space, they cannot specify which particular feature information is used by the classifier; in practise, this means that each information feature must have a value even if it is redundant. As a result, when computational resources, and processing time, are limited, SVMs may be impractical.

1.8

Scope and Outline of this Thesis

The methods used in protein fold recognition and ensemble development, as described in the previous sections, underline a complex problem: how best to combine such an array of disparate technologies into a single fold recognition ensemble in order to identify the known template most suitable for modelling a given query. This work examines the development and optimisation of a series of different recognition algorithms. In addition, it provides a comparative analysis of how these recognition algorithms might be combined in an ensemble using some of the more successful techniques currently available. Based on these analyses, an explanation is proposed as to why fold recognition ensembles often produce more accurate results over single methods. Finally, by examining the short-comings of the current state-of-the-art, this work attempts to improve recognition efficiency by proposing new ideas for ensemble construction.

This thesis will not attempt to address the task of improving the quality of models built for a given query; the focus is purely on the task of developing a better way of identifying suitable template proteins, which can then be used to build query models.

Introduction

97

Briefly, the main aspects of this work are: · Chapter 2 (page 99) provides a more detailed account of the problems addressed by this thesis, including the practical considerations and benefits of an improved fold recognition ensemble; it then provides a short outline of what was done in an attempt to solve these problems, and the eventual creation of `Phyre' (the Protein Homology/analogY Recognition Engine), the successor to the `3D-PSSM' server. In addition, this chapter describes the underlying requirements for constructing an effective ensemble, including the management of the data used for training and testing, and the metrics used in the final assessments. · Chapter 3 (page 126) describes the development of `Dynamic', a powerful dynamic programming recognition package, which is the main recognition engine in `Phyre'. `Dynamic' is a robust and flexible search tool that includes many features such as E-value calculation, data integrity checking, structure-specific alignment parameters, and an easily extendible range of dynamic programming alignment algorithms (including sequence-profile and profile-profile algorithms). The chapter also includes details about the development of a new (at the time) profile-profile algorithm: the Bi-Directional Heterogeneous Inner Product (B-DHIP) algorithm. It ends with a description of the optimisation and benchmarking of 31 different recognition algorithms that were later used in the development of the ensemble. · Chapter 4 (page 170) describes the development of the final ensemble algorithm used in `Phyre', and how it was shown to be superior to previous methods of generating ensembles used in Meta servers. The chapter includes a description of a failed attempt to produce a fold recognition ensemble using Bagging and Boosting (two methods well-known in the field of machine learning for producing high quality classifier ensembles); potential reasons for this failure are

Introduction

98

also discussed. The remaining sections describe how ensembles were built using SVMs, 3D-JURY clustering, and 3D-COLONY clustering (a new protocol developed for this research), and how 3D-COLONY was ultimately found to be the most accurate method for enhancing fold recognition. · Chapter 5 (page 217) includes a brief analysis examining how much of an increase in accuracy the single best ensemble system offers over the single best recognition algorithm; the analysis concludes with an explanation of how this improvement is achieved. · Chapter 6 (page 223) ends the thesis with a brief discussion of the final results and provides suggestions for possible future work; in particular, the potential for including additional protein structural data into the recognition ensemble.

Development of `Phyre'

99

Chapter 2 Development of `Phyre'

2.1 Summary

This chapter describes the overall scope of this thesis, describing the nature of the problem of constructing an effective fold recognition ensemble, how this problem was addressed, and what resources were necessary for assessing the success of the proposed solution. § 2.2 provides a brief description of the challenges of carrying out this research and, the methods that were used to overcome them. § 2.3 outlines the development of `Phyre' (the Protein Homology/analogY Recognition Engine) and its two main components: `Dynamic' (the central recognition algorithm engine), and the 3D-COLONY ensemble system. § 2.4 describes, in detail, the requirements for a thorough fold recognition benchmark, including the nature of the template fold library developed from the SCOP30 database, the selection of a suitable set of query proteins to train and optimise the recognition algorithms, and the selection of a suitable set of query proteins to test the optimised algorithms. Finally, § 2.5 provides details about the selection of the methods used for the benchmarking assessment. These include the average precision metric (for measuring the quality of the recognition algorithms), the simplex method (used in the

Development of `Phyre'

100

optimisation of the algorithm parameters), and the empirical precision standardised scoring framework (used extensively during the development the ensemble system).

2.2

Aims and Objectives

The primary objectives of this research were: to build on the lessons of the first five CASP evaluations, to develop a new fold recognition server to succeed `3DPSSM', and to attempt to understand the reasons why Meta predictors perform so well (when compared to stand-alone methods) in order to carry out further work to improve their accuracy.

The main focus of this work is the development of an enhanced method of determining the correct structural fold for a given query protein sequence, using an ensemble of optimised recognition algorithms. All analyses are performed on ordered, high quality, well annotated template protein structures taken from the ASTRAL compendium (Brenner et al., 2000); as a result, issues such as taking into account the model quality of the templates used are not addressed. This work focuses purely on the development of a better way to identify suitable template proteins from a databank of known structures; other features necessary to the construction of a successful fold recognition server, such as domain boundary determination and model refinement, are not discussed.

In order to create an enhanced fold recognition server, it was first necessary to identify the potential avenues of research that were likely to yield success. Recent events at CASP5 had shown that enhancements in fold recognition could be made through strategic use of: · profile-profile comparisons (see Figure 2.1(a), page 102); · inclusion of structural information (see Figure 2.1(a), page 102);

Development of `Phyre'

101

· and methods that combine results in an ensemble (see Figure 2.1(b), page 102). Therefore, it was necessary to systematically analyse each of these areas and develop new methods to improve overall recognition performance. Through careful examination, it was possible to identify areas of potential improvement in several state-of-the-art systems, and to devise novel ways of circumventing these shortcomings. Using the results from this research, `Phyre' (the Protein Homology/analogY Recognition Engine) was developed as a stand-alone fold recognition Meta server.

2.3

Overview of `Phyre' Development

The motivations behind the development of `Phyre' were manifold. First and foremost was the desire to develop a fold recognition server capable of succeeding `3DPSSM' and contending with the current state-of-the-art servers at the CASP evaluation. To succeed, it was also necessary to examine the reasons behind the success of ensemble methods and advanced recognition techniques, and attempt to build an understanding of how to employ such systems in order to expand the boundaries of fold recognition. Finally, the research aimed to demonstrate that stand-alone servers have the capacity to match the success of Meta servers, and, possibly, to exceed their capabilities by exploiting the advantages, which a centralised and controlled ensemble system has over a diffuse ensemble system.

2.3.1

Designing the Assessment

In general, fold recognition systems rely on the detection of similarity between a protein sequence of interest and another sequence of known three-dimensional structure. As described in § 1.7.1 (page 84), three major factors contribute to the success of these template-based structure prediction methods: 1. Nature of the algorithm for query-template comparisons. 2. Search parameter estimation.

Development of `Phyre'

102

sequence profile

(a)

sequence

C V L E L A T H Q W G

E E E E

W

F

(meta) profile vs. (meta) profile

(b)

collected models

H H H

C Q S

S

N

C

A

F

E

T

A

L

E

I

M

3D-JURY consensus model

predicted local structure

sequence profile

H H H E E E E E

Figure 2.1: Protein structure prediction methods. (a) Profile-profile comparison methods

utilise the profiles generated by the above mentioned sequence alignment methods. Instead of referencing a substitution score, these methods compare two vectors with each other when building the dynamic programming matrix used to draw the alignment. The comparison is usually conducted by calculating a dot-product of the two positional vectors (as shown in the figure), or by multiplying one vector by a substitution matrix by the other vector; however, there is no set algorithm. Depending on the choice of the comparison function, the vectors are often rescaled before the operation. The sequence variability vectors are sometimes also augmented with Meta information, such as predicted secondary structure, as indicated in the figure. (b) Meta predictors represents statistical approach to improving the accuracy of protein structure predictions. Simple Meta predictors collect models from prediction servers, compare the models, and then select the one most similar to all the other models. The consensus model corresponds to a model selected from the collected set and represents the final prediction. Derived from a figure in Ginalski et al. (2005), originally published under open access by Oxford University Press.

Development of `Phyre'

103

3. Nature of the template structures and their profiles. Existing Meta servers are powered by a set of algorithms for which all three of these factors vary. This hinders an understanding of their relative contribution to the overall effectiveness of the ensemble, and limits the feasibility of combining them optimally. In contrast, all fold recognition algorithms used in this research have been trained on the same data, under the same optimisation strategy, and tested under identical conditions. Thus, the performance of the ensembles built for this work is generated solely by algorithmic and parametric variety.

Combining classifiers or predictive algorithms in ensembles to improve performance is an established research area shared between statistical pattern recognition and machine learning. Unfortunately, even after several decades of research, the theoretical groundwork of ensemble theory does not yet provide us with a recipe for creating optimal ensembles (Kuncheva & Whitaker, 2003; Zhou et al., 2002). As a result, various heuristics must be explored.

In order to design and build an enhanced fold recognition ensemble system, there are several prerequisites: first, to develop and benchmark software capable of performing a variety of different recognition algorithms while utilising as much information as possible; second, to develop a standardised scoring framework so that all methods can be reliably compared; thirdly, a variety of different techniques that can be used to combine individual, fold recognition systems into an ensemble; and finally, a procedure to select an optimal, or quasi-optimal, subset of individual methods that constitute the final ensemble.

2.3.2

`Dynamic'

In order to analyse and test many different fold recognition algorithms, it was necessary to develop software that was flexible enough to perform all of these. The `Dynamic' package was developed to meet this need; `Dynamic' is a powerful appli-

Development of `Phyre'

104

cation, capable of aligning any number of sequences and/or profiles, using a multitude of alignment algorithms, and calculating all the necessary statistics to provide a measure of recognition accuracy. Many different sequence-sequence, sequenceprofile, and profile-profile comparison methods are included in `Dynamic', as well as a capability to use structural sequences and profiles in the alignments. This, in turn, provides a large number of possible algorithm combinations to generate sufficient diversity for an effective ensemble.

2.3.3

Analysis of Individual Recognition Algorithms

Each of the individual recognition algorithms used in this research was trained and optimised on a a common set of remote homologies taken from SCOP30. Each method was then benchmarked using a carefully chosen testing set. PSI-BLAST successfully identified 21.1% of all correct homologous query-template (QT) relationships in the testing set at high confidence (i.e. 95% precision or above); these QT relationships provided enough coverage to accurately annotate 23 queries (out of a possible 50) in the same testing set. Similarly, the single best method, from the pool of optimised recognition algorithms, confidently identified 54.2% of all QT relationships, which accurately annotated 36 queries. A full description of the benchmarking performed using `Dynamic' is included in § 3 (page 126).

2.3.4

Ensemble Analysis

Following the benchmarking of the many different recognition algorithms using `Dynamic', the next task was to analyse the best methods for combining all these results into a reliable ensemble that could extract the correct answers.

When combining the results of different algorithms, it is useful to have a common scoring framework. For this purpose, the empirical precision (or EP) metric was developed (see § 2.5.3, page 120). The EP metric is a standardised scoring scheme from 0 to 1, reflecting the empirically derived probability that a match is correct,

Development of `Phyre'

105

based on the results of the training for a given recognition algorithm. The results from all the individual fold recognition algorithms were converted to this common scoring framework.

Combining the results of fold recognition algorithms can be performed in several ways. However, in general terms, these fall into two categories: those that work in three-dimensional space, using the models produced by the individual algorithms, e.g. by making structural comparisons between models; and those that work purely on the scores and protein identifiers, e.g. counting the number of semi-confident matches a query may have to the members of a given superfamily. After examining many different methods for building ensembles, a new algorithm called 3D-COLONY was developed, which proved to be more accurate than any of the other ensembles that were benchmarked during this research. 3D-COLONY is a hybrid of the two clustering schemes described above, and combines structural comparison methods from 3D-JURY (see 1.6.3.2, page 76) with techniques that attempt to model the free energies of potential structures in order to estimate their reliability (see § 4.6, page 196). The protocol is similar to the colony energy approach, used in loop modelling, where the confidence measure is analogous to an enthalpy term, and the structural similarity score is analogous to an entropy term (Xiang et al., 2002).

Given a means of combining many different fold recognition algorithms, it is then necessary to determine an optimal, or quasi-optimal, subset of methods for use in an ensemble. A simple pooling of all methods is rarely the best option. Commonly occurring false positive results, generated by different algorithms that are not sufficiently diverse, can lead to an ensemble that performs worse than the single best component algorithm (Kuncheva & Whitaker, 2003). The rigorous selection of a combination of methods, which optimises ensemble performance, leads to a combinatorial explosion. To avoid this problem, a heuristic method was used, which performed a `greedy' build-up of the ensemble by adding component methods one at a time; searching for the best pair, best triplet, etc. This allowed a patchwork of

Development of `Phyre'

106

algorithms to be selected, with the aim of maximising ensemble performance on a stringent testing set of remote homologues.

As an alternative method of circumventing the problem of component selection, an investigation was done into the use of Support Vector Machines (SVMs) in classifying candidate superfamily matches, as correct or incorrect, using information from all individual fold recognition methods simultaneously. Alignment scores and structural clustering scores were combined into high dimensional feature vectors for both the training and testing data sets (see 4.5, page 185).

The best ensemble in this research (3D-COLONY) detected 64.0% of all correct homologous QT relationships at 95% precision or above. Overall, 41 of the 50 individual query proteins in the testing set could be accurately annotated at the same precision level. In comparison to the improvement that the single best method (from the benchmark described above) had over PSI-BLAST, this represented a 29.6% increase in the number of correct homologous QT relationships, and a 46.2% increase in the number of accurately annotated queries. This best performing ensemble forms the core of `Phyre'. A full description of the analyses performed during ensemble development is included in § 4 (page 170).

2.3.5

The Final `Phyre' System

The final `Phyre' fold recognition system, developed as a result of this research, is summarised in Figure 2.2 (page 108). For every template protein in the databank, structural and sequence data is extracted (i.e. primary structure, secondary structure, primary structure 3-iteration PSI-BLAST PSSM, and PSIPRED predicted secondary structure and secondary structure profile). For a given query protein, the same data is extracted except for the actual secondary structure, which is unknown. The template and query data are passed through 10 different recognition algorithms (selected from the ensemble analysis), the results of which are combined in the 3D-

Development of `Phyre'

107

COLONY structural clustering ensemble. Finally, the clustering ensemble allocates a 3D-COLONY score to the query-template pair. The 3D-COLONY scores are used to select the best template proteins from which to model the query protein structure.

2.4

Fold Library and Data Sets

When benchmarking a protein fold recognition algorithm, three things are necessary: a template fold library against which a query protein sequence can be scanned in order to find its closest structural match, a training set of query protein sequences can be scanned against the fold library (the results of which will be used to determine the optimal search parameters for the given recognition algorithm), and a testing set of query protein sequences that can be used to ascertain how well the optimised search parameters perform. The accuracy of a specific recognition algorithm, using a given set of optimised parameters, is measured by how well it performs against the testing set, as the only results that are of interest, for benchmarking purposes, are those achieved using previously unseen queries. This stems from the assumption that the testing set should provide a reliable estimate of how well the recognition algorithm will perform in a blind trial.

Following on from these requirements, there are three important issues that must be addressed: 1. It is important to select a template fold library that is suitably representative of the known protein structure space, and diverse enough to ensure that it is evenly distributed throughout that structure space. 2. A diverse training set must be constructed for the purposes of optimising recognition algorithm parameters. Each training query must be suitably different from all the other training queries so that the final set of search parameters are not skewed towards a particular region in structure space.

Development of `Phyre'

108

Template Protein

Primary Structure

PSI-BLAST STRIDE

Query Protein

Primary Structure

PSI-BLAST

3-iteration PSSM

PSIPRED

3-iteration PSSM

PSIPRED

Predicted Secondary Structure Secondary Structure

Predicted Secondary Structure

3D-COLONY Ensemble

Final 3D-COLONY Score

Figure 2.2: An illustration of a query-template comparison in the `Phyre' fold recognition

system. For every template protein in the databank, structural and sequence data is extracted (i.e. primary structure, secondary structure, primary structure 3-iteration PSI-BLAST PSSM, and PSIPRED predicted secondary structure and secondary structure profile). For the query protein, the same data is extracted except for the actual secondary structure, which is unknown. The template and query data are passed through 10 different recognition algorithms (selected from the ensemble analysis), the results of which are combined in the 3D-COLONY structural clustering ensemble. Finally, the clustering ensemble allocates a 3D-COLONY score to the query-template pair. The 3D-COLONY scores are used to select the best template proteins from which to model the query protein structure.

Recognition Algorithm 10

Recognition Algorithm 1

Recognition Algorithm 2

Recognition Algorithm 3

Recognition Algorithm 4

Recognition Algorithm 5

Recognition Algorithm 6

Recognition Algorithm 7

Recognition Algorithm 8

Recognition Algorithm 9

Development of `Phyre'

109

3. An equally diverse query testing set must be constructed in order to assess the robustness and reusability of the optimised parameters. As in the training set, each testing query must be non-homologous to all the other testing queries in order to avoid skewed measures of accuracy. However, each of the testing queries must also be suitably different from each of the training queries in order to ensure a truly blind assessment.

2.4.1

Fold Library

The SCOP database (version 1.65) was selected to be the basis of the template fold library because of its high quality curation and well-delimited hierarchy (see § 1.3.1, page 20). In order to get as even a distribution across structure space as possible, it was decided that a subset of SCOP (based on sequence clustering according to maximum percentage pairwise identity) should be used. The SCOP30 subset was chosen because of its proximity to the protein homology/analogy Twilight Zone (see § 1.5, page 48); because of this proximity, it was considered that this subset would be likely to provide the most rigorous training and testing. In addition, in order to remove any issues of contiguity that might have arisen in later structural analyses, only single chain domains from SCOP30 were included (reducing the number of entries from 4,821 to 4,804).

The fold library used in the benchmarking contained three major types of protein data: primary structure, predicted secondary structure, and actual secondary structure. The 3D-GENOMICS database (M¨ller et al., 2002; Fleming et al., 2004) u was used to administer the fold library data. 2.4.1.1 Primary Structure

For primary structure, BLOSUM62 was used as the standard background substitution matrix. The 20 amino acid single letter codes and the four additional characters used in the BLOSUM matrix (i.e. `B', `Z', `X', and the gap character), and their re-

Development of `Phyre'

110

spective background frequencies, constituted the accepted alphabet and amino acid probability distribution.

A key design feature in `Phyre' was its ability to use primary structure profile data from readily available sources. The alternative would have been to design elaborate new methods of interpreting multiple sequence alignments in such a way as to capture the essence of the protein family that they represented. It was, therefore, assumed that using a well-established method of building profiles would make the software more maintainable and more portable. Therefore, it was decided that the the developmental focus should be on designing effective comparison algorithms rather than on the building of profiles. PSI-BLAST was chosen as the standard method for constructing primary structure profiles because of its stability, established success, and usability. All the primary structure profiles used were 3-iteration PSI-BLAST profiles constructed against a non-redundant protein sequence database, derived from GenBank (Benson et al., 2005). Default PSI-BLAST parameters were used during the profile construction.

In order to gain a greater understanding into the relative contribution of the factors that affect an ensemble's accuracy, all individual fold recognition methods used in this research were trained on the same data, under the same optimisation strategy, and tested under identical conditions. Thus, the performance of the ensembles constructed from these methods was generated solely by algorithmic and parametric variety (see § 2.3.1, page 101). As a result, no analysis was performed into the importance of variability of profile generation (for example using 5 iteration PSI-BLAST profiles rather than 3-iteration profiles), as it would have added an extra dimension of complexity, which would have been beyond the scope of this thesis. However, implementing such changes would be highly likely to produce a more diverse pool of methods, which could potentially improve the accuracy of any ensembles generated in future work.

Development of `Phyre'

111

2.4.1.2

Predicted Secondary Structure

For predicted secondary structure, a simple substitution matrix of +1 for matches and -1 for mismatches was used, as described in Kelley et al. (2000). A three letter alphabet of `C', `H', and `E' was used to represent predicted coil, helix, and sheet respectively. Finally, the respective frequencies of the three predicted structure types (derived from SCOP95 version 1.65 with pseudo-count correction to ensure non-zero values; see Equation 3.2, page 131) were used as the accepted probability values: 0.4200 for coil, 0.3479 for helix, and 0.2321 for sheet.

The predicted secondary structure of the queries and template in the fold library were determined using the PSIPRED program (Jones, 1999a) because of its high level of accuracy and its successful use in other automated servers, such as Robetta (see § 1.6.3.3, page 77). PSIPRED produces a three value probability profile (one probability each for coil, helix, and sheet), of which the single highest value is used to construct the predicted secondary structure sequence. 2.4.1.3 Actual Secondary Structure

For actual secondary structure, the only sequences that were produced were for the templates in the fold library (since they were assumed to be the only inputs of known structure). Secondary structure was determined using the STRIDE program because of its consistency with human expert secondary structure determination (Frishman & Argos, 1995). STRIDE uses a seven character alphabet consisting of: `H' for -helices, `G' for 310 -helices, `E' for extended conformation (i.e. sheets), `B' or `b' for isolated bridges (`B' was used in this research), `I' for -helices, `T' for turns, and `C' for coils (i.e. all other structures).

Again a simple substitution matrix of +1 for matches and -1 for mismatches was used when aligning predicted secondary structure sequences. For the purposes of this research, `H' and `G' were both regarded as helix, `E' and `B' were both

Development of `Phyre'

112

regarded as sheet, and the remainder were regarded as coil. Finally, the respective frequencies of the seven characters (derived from SCOP95 version 1.65 with pseudocount correction to ensure non-zero values) were used as the accepted probability values: 0.3151 for `H', 0.0376 for `G', 0.2220 for `E', 0.0118 for `B', 0.1972 for `C', 0.0001 for `I', and 0.2157 for `T'. 2.4.1.4 Tertiary Structure

As well as considering the sequence similarity between entries in the chosen fold library, it was also necessary to take into account the quality of the structures available for those entries. Every entry in SCOP is likely to correspond to multiple entries in the PDB; the PDB contains coordinate entries of varying quality that may contain irregularities (Br¨nd'en & Jones, 1990). Therefore, it is necessary to be highly a selective when choosing a particular PDB entry to represent a given SCOP protein. To solve this problem, 3D-GENOMICS (see § 2.4.1, page 109) uses the PDB codes determined by the ASTRAL compendium (Brenner et al., 2000) as the representative structures for each SCOP entry.

The ASTRAL compendium originally used its Summary PDB ASTRAL Check Index (SPACI) score to provide a first-order estimate of the quality of crystallographically determined protein structures (Brenner et al., 2000). The SPACI score was designed to act as a guide to selecting the best structure, rather than as a replacement for manual curation. It incorporated three different quality assessment metrics: the resolution of the original data (the minimum plane separation within the crystal lattice for which a diffraction pattern is produced), how well the model fitted the data (the R-factor), and stereochemical check parameters, which indicate how well the structure complies with standard molecular geometry according to WHAT CHECK (Hooft et al., 1996) and PROCHECK (Morris et al., 1992). Many enhancements have been added to the compendium (Chandonia et al., 2002). When benchmarking of `Dynamic' began (2004), the latest SCOP release was version 1.65, and ASTRAL had recently begun selecting PDB representatives of SCOP entries

Development of `Phyre'

113

using Aberrant Entry Re-Ordered SPACI (AEROSPACI) scores (Chandonia et al., 2004). AEROSPACI scores are similar to SPACI scores except that PDB entries manually annotated by the SCOP authors as aberrant (i.e. chimeric, circularly permutated, disordered, missing large regions, erroneous, misfolded, mistraced, mutant, or truncated) are penalised so that they are less likely to be chosen as representative structures.

For the purposes of large-scale analysis, the ASTRAL compendium provides a robust system of verifying large numbers of protein structures based on expert knowledge. It is primarily for this reason that it was chosen as the source of structures for the template fold library.

2.4.2

Building Training and Testing Data

In order to select training and testing sets that were suitably diverse, a number of entries from the SCOP30 database were randomly selected and used as queries. This ensured that all the queries would be less than 30% identical to any single entry in the template fold library (since that too was based on the SCOP30 database). Similarly, all the queries in the training and testing sets would be less than 30% identical to each other.

Despite the great care taken in the curation of the SCOP database and the ASTRAL compendium, several additional precautions were necessary, when choosing the training and testing sets, to avoid skewed results during benchmarking as well as to provide the most useful results. The primary goal of this research was to develop an automated ensemble system capable of recognising appropriate folds for an unknown query, and selecting a template from which an accurate model of the query could be built. Therefore, it was necessary to select an appropriately diverse and dissimilar query set from the fold library, and to identify the most structurally similar templates as the best for using in model construction. The selection procedure

Development of `Phyre'

114

was as follows: 1. All domains annotated as Rossman or Rossman-like were excluded from selection since many such domains are putatively homologous between folds and superfamilies (Sadreyev & Grishin, 2003). For a full list of excluded folds and superfamilies, see Table 2.1 (page 116). 2. Representative queries for both the training and testing sets were selected at random from SCOP30. No fold (and therefore no superfamily) was represented more than once. Similarly, no fold present in the training set was allowed to appear in the testing set. 3. To prevent skewing of the benchmarking results, no queries from superfamilies with more than 100 members in SCOP30 were selected. 4. Since the aim of fold recognition is to provide plausible structures for novel proteins, it was necessary to identify the fold and superfamily members in SCOP30 that would act as the most accurate templates for the selected queries. As was the case at CAFASP3 (Fischer et al., 2003), two domains were regarded as structurally similar if they shared a MAMMOTH Z-score of at least 5.0 (Ortiz et al., 2002). 5. In order to provide adequate representation, each of the selected training set queries had at least 5 superfamily members in SCOP30 (itself excluded) that shared with it a MAMMOTH Z-score of 5.0 or more. Due to size limitations, and the previously listed restrictions, testing set queries were only required to have 2 superfamily members (itself excluded) that fulfilled this condition. Using this filtering and selection procedure, a final training set of 105 nonredundant queries (comprising 1,124 individual query-template relationships), and a final testing set of 50 non-redundant queries (comprising 247 individual querytemplate relationships) was constructed. A complete list of both sets can be found

Development of `Phyre'

115

in Appendix C (page 247).

When assessing the accuracy of a given recognition algorithm in this research, only superfamily members that share a MAMMOTH Z-score of at least 5.0 to their respective query were considered to be correct. However, it would have been counterproductive to count the remaining superfamily members, or other members of the respective fold, to be incorrect because (even if they did not share enough structural similarity with the query to be useful in model building) such results could still provide additional useful sequence, structural, or functional information when used in a real-world analysis. As a result, any such matches were simply ignored when assessing the benchmarks.

It should be noted that both the training and testing sets were also adapted into (what were termed) CASP-like data (in homage to the CASP evaluation; see § 4.4, page 182). These data sets were used as part of the benchmark designed to reflect a `real-world' situation, of difficult structure prediction targets, where no individual constituent method within an ensemble can provide a confident template match to a query. All ensembles in this research were constructed using the CASP-like training set, and then tested using the CASP-like testing set. This meant that any homologous relationships, that were confidently detected by the ensemble, were found solely because of the ensemble and not because of any individual constituent methods of high accuracy, i.e. the ensemble was able to combine many weak structure predictions in a synergistic manner and produce a strong overall structure prediction. The best performing ensembles were also tested using the full testing set (i.e. the standard test data containing all answers) to see how well they performed in comparison to the individual fold recognition algorithms and PSI-BLAST.

Development of `Phyre'

116

Table 2.1: Rossman and Rossman-like folds and superfamilies from SCOP version 1.65. Rossman and Rossman-like fold and superfamily names NAD(P)-binding Rossmann-fold domains Nucleotide-binding domain MurCD N-terminal domain Nucleoside phosphorylase/phosphoribosyltransferase catalytic domain Cryptochrome/photolyase, N-terminal domain PreATP-grasp domain DHS-like NAD/FAD-binding domain GckA/TtuD-like Dehydroquinate synthase-like Prismane protein-like

1 2 3 4 5 6 7 8 9 10

2.5

Benchmarking Assessment

Benchmarking a recognition algorithm is a relatively simple task which involves determining the ideal search parameters for producing the most accurate results. The difficulty is in deciding which measure of accuracy -- or fitness -- to use. In the case of protein fold recognition algorithms, their measure of fitness should reflect how well they perform in finding template proteins that are structurally similar to a given query protein, while ignoring unrelated template proteins.

For this research, the measures of precision and recall were used extensively. Precision and recall values for a fold recognition algorithm can be measured at any point in a set of results, and are defined as: TP TP + FP

Precision = and:

(2.1)

Recall =

TP TP + FN

(2.2)

where T P represents the number of true positives (correct matches that have

Development of `Phyre'

117

been found), F P represents the number of false positives (incorrect matches that have been found), and F N represents the number of false negatives (correct matches that have not yet been found). Essentially, precision is the proportion of answers that have been accurately identified as correct (out of all the answers that the algorithm has identified as correct), and recall is the proportion of correct answers that have been identified (out of all the correct answers that exist).

One method of measuring the fitness of an algorithm is to concatenate the results from all the queries in a training set, sort the results by E-value, and record the highest recall value that occurs at 95% precision; as it is possible to have more than one recall value, when precision is at 95%, only the highest value is of interest. The higher the recall value, the `fitter' the set of search parameters used to create the results. The problem with this method is that it is possible for a given query, with a particularly large superfamily (e.g. Immunoglobulins), to saturate the results at 95% precision simply because it has so many homologues.

2.5.1

Average Precision -- Metric of Recognition Quality

A better way of measuring fitness is to use a metric that provides a complete overview of the general quality of an algorithm. The advantage of this type of measure is that results from difficult queries are given the same weighting as results from easier queries, meaning that ideal search parameters must provide good results for all training examples rather than a select few. The metric used in this work is the average precision (AP) criterion (as described in Chen, 2003, originally cited from Salton, 1991).

The AP criterion has been used extensively in performance evaluation text and audio database retrieval systems (http://trec.nist.gov). The method is very simple and prevents data loss by taking account of all the results in the final measurement. Firstly, the results from all the queries in the training set are concatenated

Development of `Phyre'

118

and sorted by E-value. Assuming the returned list of entries contains N correct answers, Equation 2.3 is the formula for calculating the AP, where pi is the rank of the i-th true positive. Note that i/pi is the precision value at the i-th positive in this iterative process, and Equation 2.3 is an approximate integral to calculate the area under the resulting Precision-Recall curve: 1 N

N

AP =

i=1

i pi

(2.3)

One possible alternative to this metric is the mean-AP. This measure involves averaging the AP scores for a set of queries, easily done since an AP value is a normalised ratio between 0 (worst) and 1 (best). The mean-AP is calculated using Equation 2.4, where n is the number of queries. This measure prevents queries with large numbers of correct answers from saturating the final measure of fitness: 1 n

n

mean-AP =

APi

i=1

(2.4)

Using the mean-AP to measure recognition accuracy can provide a robust metric when multiple queries are involved. However, in this research, the primary goal was to optimise search parameters that would not only produce accurate results, but would also produce E-values that were as comparable as possible between queries, i.e. a given E-value for a particular query would represent the same level of confidence (in the accuracy of the result) as the same E-value for any other query. Hence, the measure of accuracy that was adopted was the AP of the results for all queries, concatenated together and sorted by E-value.

As noted in § 2.4.2 (page 113), when assessing the accuracy of a given recognition algorithm, only superfamily members that share a MAMMOTH Z-score of at least 5.0 with their respective query were considered correct. However, any other members of the respective superfamily or fold were simply ignored in the AP calculation, since regarding them as incorrect was likely to be counterproductive.

Development of `Phyre'

119

2.5.2

Simplex Method for Function Minimisation

Once the metric of recognition quality had been chosen, the next task was to decide how best to determine the ideal search parameters for each search method. The major assumption here is that there will usually be several different sets of parameters (e.g. insertion cost, affine gap penalty, primary structure weighting, etc) that will produce accurate results for each alignment method. By regarding each parameter as a dimension in a search space, it is possible to use a simulation algorithm to search non-exhaustively through the dimensions and determine which set of parameters provides the most reliable results. The fitness function element of the algorithm, which is used to search the parameter space, measures the quality of the results produced by each set of parameters.

The idea of using a genetic algorithm technique in order to scan the parameter search space was considered; however, the sheer number of alignments that would have needed to be run, to complete the algorithm for a single method, rendered the idea impractical. Instead, the simplex method for function minimisation (Lagarias et al., 1998; Nelder & Mead, 1965) was chosen as a means of scanning the parameter search space and refining the quality of the results. In this research, the fitness function used was 1 - AP of the combined training set. The parameter values (that were used by the simplex to begin its search) were chosen at random. The fitness function is 1 - AP because a simplex minimises a given function; therefore, since AP values range between 0 and 1, the goal of the simplex is to maximise the AP.

A simplex is essentially a polygon of N + 1 vertices where N is the number of dimensions it is searching (in the case of this research, the number of different parameters being benchmarked). It possesses a series of four scalar parameters which control the searching algorithm: coefficients of reflection (), expansion (), contraction (), and shrinkage (). These control the rate of search by determining how the simplex changes shape or size in order to home in on an ideal score; it

Development of `Phyre'

120

effectively `walks' through the search space, finding the areas of best fitness. An example is shown in Figure 2.3 (page 121). According to Nelder & Mead (1965), the conditions that are most appropriate for efficient searching are:

> 0,

> 1,

> ,

0 < < 1,

and 0 < < 1.

The relation of > , while not stated explicitly in Nelder & Mead (1965), is implicit in the algorithm description and terminology (Lagarias et al., 1998). The nearly universal choices used in the standard Nelder-Mead algorithm are: 1 = , 2 1 and = . 2

= 1,

= 2,

The main advantage the simplex method has over a genetic algorithm is that each `step' it takes requires testing with just one set of parameters against each query protein. The main disadvantage is that a simplex can sometimes get caught within local minima without searching far beyond its boundaries.

As part of this research, two object-orientated Perl modules (called `Vertex.pm' and `Simplex.pm') were developed. Together they are able to perform efficient simplex searches for a given set of parameters within a specified search space. The modules are highly annotated, stable, flexible, and capable of using any desired fitness function, provided the aim is to minimise the final value.

2.5.3

Empirical Precision -- Standardised Scoring Framework

When performing a databank search, the usual method of measuring confidence is by E-value. E-values provide an indirect measure of the precision of the results at a given threshold. That is to say, a given E-value threshold suggests that results with a similar, or better E-value, have a particular chance of being correct or incorrect. In theory, standard E-values should be an adequate measure for comparing different

Development of `Phyre'

121

Parameter B

Parameter A

Figure 2.3: An example illustrating the simplex search algorithm. A simplex calculates its

`fitness' using a `fitness function' that takes, as its input, parameter values (as defined by the simplex's position in the parameter space). A simplex searches its parameter space until it finds a minimum in the fitness function. In the above example, the red area represents the minimum of the target parameter function, the graded colours represent progressive movement away from the minimum. Starting with a simplex of n + 1 points in the n-dimensional parameter space, a series of steps is taken, most of which just move the point of the simplex with the highest objective function through the opposite face of the simplex to a lower point. Other search directions are generated by reflection, expansion, and contraction of the simplex from the previous step. The white circle represents the final position of the simplex, having found the minimum of the target parameter function.

Development of `Phyre'

122

sets of results. However, this is not always the case. E-values are calculated from data fitted to an extreme value distribution (see § 1.4.5, page 39), and such distributions can be inconsistent when used to compare results from different recognition algorithms (e.g. two different profile-profile comparison methods), or when used to compare results from using queries of different lengths. This is usually caused by deviation of the results' score distribution from a true extreme value distribution. As a result, different systems tend to produce E-values in different ranges for a given level of precision. In addition, E-values are measured on a logarithmic scale, making them problematic when used for direct linear comparison. This raises the question of how best to compare sets of results from two different recognition algorithms if E-values are not completely suitable.

During each stage of the parameter optimisation described in § 2.5.2 (page 119), the results for the respective recognition algorithm (using all training queries) are concatenated, re-sorted according their E-values, and the average precision is calculated (as part of the fitness function). In order for a meaningful comparison to be made between two sets of ranked results, it is necessary to have a normalised confidence measure for each QT pair. The simplest method is to label each QT pair in a concatenated results file with its respective precision value (see Figure 2.4, page 123). This provides a normalised score guaranteed to be between 0 and 1, providing an accurate measure of the proportion of correct answers at any given point in the results file. These scores can also be used as a simple probability measure of any given result being correct. As an extension of this reasoning, once a set of results has been labelled in this way, it is possible to use the data as a function for mapping E-values to precision values. Applying this function, it is possible to derive an empirical precision (EP) value for a set of results where the correct answers are unknown (see Figure 2.5, page 124).

In this research, for each set of optimised parameters, all training set queries were scanned against the template fold library, concatenated, and re-sorted by E-

Development of `Phyre'

123

# Tmplt 61496 42628 73928 42130 83834 88491 42117 42675 70845 77135 27173 27183 87946 15623 35302 15010 27198 27223 81100 33186 85340 42114 66830 41879 70632 84036 73160 63279 61241 85197 77645 34083 77645 70624 78834 15637 67265 60812 67265 . . . . . . . . .

Query 61401 62591 62591 61401 41388 62591 61401 62591 68584 62591 72887 72887 87409 15149 71591 15149 72887 72887 79778 60624 85341 61401 79778 41878 70633 41878 85341 25331 41748 68584 66830 34067 79778 65828 79778 15149 66830 15149 79778 . . . . . .

E-value 9.533239e-02 9.724122e-02 2.945397e-01 3.058658e-01 5.243769e-01 6.012425e-01 6.061110e-01 8.203679e-01 9.277153e-01 1.125356e+00 1.570367e+00 2.835467e+00 3.152189e+00 3.590079e+00 4.733294e+00 5.709661e+00 5.934810e+00 6.368926e+00 6.411426e+00 6.720959e+00 6.837969e+00 6.942080e+00 7.796633e+00 7.851480e+00 8.353461e+00 1.074428e+01 1.156254e+01 1.183969e+01 1.303981e+01 1.315581e+01 1.532304e+01 1.567356e+01 1.615498e+01 1.676010e+01 1.676191e+01 1.706969e+01 1.771669e+01 1.829443e+01 1.832465e+01 . . .

Prec. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.9966 0.9720 0.9628 0.9545 0.9516 0.9514 0.9497 0.9415 0.9387 0.9378 0.9337 0.9333 0.9100 0.9062 0.8956 0.8823 0.8822 0.8611 0.8510 0.8421 0.8418

TP/FP 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0

Figure 2.4: An example of a training output file, consisting of concatenated results for a number

of queries against a databank of templates. Based upon whether the query-template pair is correct (i.e. homologous and structurally similar, see § 2.4.2, page 113), the actual precision value for each E-value is calculated using Equation 2.1 (page 116). Column 1 represents the template identifier, column 2 represents the query identifier, column 3 represents the E-value for the query-template alignment, column 4 represents the level of precision at that point in the file, and column 5 states whether the query-template pair is correct (1) or not (0).

Development of `Phyre'

124

# Tmplt 35598 38382 83990 37507 61164 77530 38780 41085 33898 71250 79646 17119 61166 66544 40242 23707 17737 84484 66274 59543 74579 38800 71254 38044 77028 71089 40808 77645 38810 76880 67265 71529 17727 33918 80752 37449 74029 87716 67992 . . . . . . . . .

Query 35600 59771 68325 37499 59541 59771 38783 79189 33865 88462 76391 17087 59541 61605 40255 23704 71729 59541 88462 59541 37441 38806 76391 38008 70088 59771 40798 73720 38806 73720 73720 38806 71729 33865 38783 37441 73720 73720 73720 . . . . . .

E-value 2.316202e-03 9.350415e-01 2.173770e+00 2.794353e+00 4.277444e+00 8.329147e+00 1.290746e+01 1.422015e+01 1.597167e+01 1.864668e+01 1.900597e+01 1.987685e+01 2.184423e+01 2.321176e+01 2.447273e+01 3.159860e+01 3.708061e+01 4.281887e+01 4.855084e+01 5.285031e+01 6.146742e+01 6.334236e+01 7.332693e+01 9.128991e+01 1.203808e+02 1.446586e+02 1.453739e+02 1.588654e+02 1.857098e+02 2.012548e+02 2.089410e+02 2.138393e+02 2.138848e+02 2.222892e+02 2.655848e+02 2.685873e+02 2.736272e+02 2.930135e+02 3.055122e+02 . . .

EP 1 1 1 1 1 0.9498 0.9341 0.9218 0.8996 0.8388 0.8354 0.8190 0.8002 0.7768 0.7607 0.7586 0.7569 0.7552 0.7535 0.7522 0.7489 0.7473 0.7323 0.6791 0.6657 0.6528 0.6524 0.6401 0.6068 0.5936 0.5865 0.5819 0.5819 0.5745 0.5554 0.5549 0.5542 0.5507 0.5466

TP/FP 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0

Figure 2.5: An example of a testing output file, consisting of concatenated results for a number

of queries against a databank of templates. The E-values for each query-template alignment are used to map to a precision value derived from the respective training data (see Figure 2.4, page 123). These precision values are the empirical precisions. Column 1 represents the template identifier, column 2 represents the query identifier, column 3 represents the E-value for the query-template alignment, column 4 represents the empirical precision at that point in the file, and column 5 represents whether the query-template pair is correct (1) or not (0).

Development of `Phyre'

125

value. Every QT pair, in each results file, was labelled with its respective precision value and subsequently used as part of the function mapping E-values to EP values. These mapping functions were used to label the results files derived from scanning all testing set queries against the template fold library.

It should be noted that actual precision values and EP values are indistinguishable for training results; however, they may or may not be equivalent when analysing testing set data (as the mapping function may not be perfect). Therefore, when discussed in this work, `precision values' will refer to true precision values derived from actual correct answers, and `EP values' will refer to the empirical precision values derived from mapping functions. EP values are used extensively throughout the remainder of this work, particularly when discussing ensemble development (see § 4, page 170). All results discussed refer to testing set data, unless otherwise stated.

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

126

Chapter 3 Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

3.1 Summary

This chapter describes `Dynamic', a powerful dynamic programming recognition package used as the main system of fold recognition in `Phyre'. § 3.2 provides a brief description of the aims of this chapter and of `Dynamic' itself, and explores the flexibility with which it performs many combinations of alignment searches. § 3.3 details how data are processed by `Dynamic': the search parameters that were optimised as part of this benchmark, the processing of profile data in order to extrapolate additional information required for several of the alignment algorithms, and the method used in E-Value calculation. § 3.4 lists the alignment algorithms that can be performed using `Dynamic', which range from simple sequence-sequence comparisons to profile-profile comparisons. The latter algorithms are described in detail, as is the development of a new (at the time) profile-profile algorithm (B-DHIP), designed as part of this

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

127

research. Finally, § 3.5 describes the individual recognition methods that were optimised and tested as part of this benchmark, and provides brief conclusions, as well as statistical analyses using the truncated receiver operating characteristic (ROC) of the final test results.

3.2

Introduction

The first step in building an enhanced fold recognition ensemble system is to encode and optimise as many different individual recognition methods as possible. The main aim of this chapter is to report the development of a variety of optimised recognition algorithms that provided a wide coverage of the recognition search space. As a result, the benchmark in § 3.5 (page 142) focuses primarily on the relative performance differences between individual methods rather than their underlying relationships and reasons for such differences. To a certain extent, the choice of algorithms used in the benchmark was fairly arbitrary; each method was selected based on its perceived distinctiveness rather than on a fundamental rationale for the analysis as a whole. Generally, methods were selected based on: evidence of success from previous studies; having intrinsically different algorithms that showed reasonable promise in preliminary analyses; algorithms being developed solely for the purposes of this research; sheer curiosity about the potential performance of certain algorithms.

In order to benchmark the variety of algorithms that were needed for this research, the `Dynamic' recognition package was developed. `Dynamic' is a powerful dynamic programming application capable of aligning a given query and template, or databank of templates, using any algorithms that are encoded into its software. It is coded in C++ using an object-orientated bridging design which allows for a large degree of flexibility when performing alignments or adding new alignment algorithms. As a result, it can perform any combination of over a hundred different alignment methods. It performs its own integrity checks on any data that is passed to it, and also contains statistics modules to calculate E-values. `Dynamic' is the

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

128

main recognition program used in `Phyre'.

3.3

`Dynamic' Data Processing

This work focuses on processing three main types of protein data: primary structure sequences and profiles, predicted secondary structure sequences and profiles, and actual secondary structure sequences. Full descriptions of the nature of the data used in this research can be found in § 2.4.1 (page 109).

3.3.1

Parameters

When optimising dynamic programming algorithms, there are several parameters to consider: 1. Gap Penalties. These are used to penalise insertions or deletions (indels) in alignments. These penalties can either comprise a fixed value or differ depending on whether they are opening a gap or extending a gap (the affine gap model). A detailed description of gap penalties can be found in § 1.4.2.1 (page 24). 2. Weightings. When combining the scoring schemes of several types of data into an alignment (e.g. an alignment between primary structure sequences with an alignment between predicted secondary structure profiles) it is often the case that one of the data types in the recognition algorithm will be more useful than the others. Under such circumstances, it is a good idea to weight each data type according to its contribution to the final accuracy of the overall algorithm. The default weight for each data type in `Dynamic' is 1. 3. Z-Shifts. The z-shift is a value subtracted from every score in the dynamic programming matrix to ensure that the expected score of aligning two positions at random remains below zero. In `Dynamic' a z-shift is specified for each data type for profile-profile comparisons; the default value in `Dynamic' is 0.

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

129

During benchmarking, these parameters are assigned to dimensions in a simplex (see § 2.5.2, page 119), and the values are optimised according to the average precision of the training system. A full description of the methodology can be found in § 2.5 (page 116).

3.3.2

Structure Specific Gaps

Gap penalty configuration in `Dynamic' is slightly more complex that weighting and z-shift configuration. Only one penalty is used across all the sum of all data types (e.g. primary structure, secondary structure, etc) for each gap; this means that a gap is only added once the score of aligning all the requested data types has been calculated. These penalties are the default penalties and can be specified for insertions, deletions, affine insertions, and affine deletions. This functionality can be extended to data type specific gap penalties. However, gap penalties in `Dynamic' are not data type specific in the same way that weightings and z-shifts are; rather than having penalty values for each data type, penalty values can be defined according to the character that appears in the first sequence of a particular data type in the aligned template. Essentially, a specific gap penalty can be attributed to a specific character in a given data type alphabet. For example, `Dynamic' can be configured to use a particular opening insertion penalty when the actual secondary structure of the template is a coil, which is different from the opening insertion penalty when the actual secondary structure of the template is a helix or a sheet. As well as offering this structure specific gap functionality, any structure specific gaps are smoothed over a three residue window. Similar methods, described by Shi et al. (2001), were shown to produce improvements in accuracy.

3.3.3

Probability and Log-odds Score Extrapolation

When the primary structure profiles are constructed with PSI-BLAST, they are produced as a matrix of log-odds scores. Alternatively, when the predicted secondary structure profiles are constructed using PSIPRED, they are produced as a matrix

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

130

of raw probabilities. However, due to the nature of several of the profile-profile alignment algorithms that were developed, it was necessary for log-odds scores and the equivalent probability values for each profile to be made available. Therefore, it was necessary to be able to extrapolate the probability distribution from a series of profile log-odds scores (e.g. when using primary structure profiles), and also be able to convert a profile probability distribution to the representative log-odds scores (e.g. when using predicted secondary structure profiles).

As described in § 1.4.6.1 (page 41), calculating profile log-odds scores is similar to calculating the scores for a substitution matrix. The profile log-odds scores produced by PSI-BLAST will generally scale similarly to those in the substitution matrix used to construct the profile. As such, `Dynamic' is able to extrapolate profile probability values from respective log-odds scores (and vice versa) by simple manipulation of Equation 1.3 (page 30). By rearrangement, `Dynamic' uses the equation:

qa,i = pa eS(a,i)

(3.1)

where S(a, i) is the log-odds score for amino acid a at profile position i, pa is the background frequency of amino acid a, qa,i is the extrapolated probability of amino acid a at profile position i, and is the normalisation value used in the original substitution matrix (e.g. BLOSUM62). The constant is included as a simple scaling value; benchmarks in the early stages of development showed this method of scaling produced reliable results. Normalising the resultant probability vectors to sum to 1.0 was considered in the early stages of `Dynamic' development; however, this was prone to errors because 3D-GENOMICS used SEG (Wootton & Federhen, 1993, 1996) to mask regions of low complexity, which were then assigned all negative logodds scores by PSI-BLAST. As a result, all probability vectors for low complexity regions were scaled up, removing information content and producing inaccurately high probabilities.

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

131

When extrapolating log-odds scores, `Dynamic' uses principles similar to those described in Appendix B.3 (page 245) with the equation: ln(

qa,i +pa ) pa

S(a, i) =

(3.2)

where is used as a pseudo-count correction. `Dynamic' uses a value of 1000 for .

3.3.4

Calculating E-Values

As mentioned in § 1.4.5.1 (page 40), effective calculation of E-values is important for producing accurate recognition results. An important choice was deciding whether or not to use predetermined calculation parameters, or to determine them uniquely for each search that was performed. Work by Sadreyev & Grishin (2003) used analytical techniques to determine the K and parameter values for a range of gapped alignment results. Their method included heuristic equations to determine parameters based on the lengths of sequences. Such systems are fast, but are prone to compounded error margins; plus they fail to take into account the effect of sequence composition in determining K and .

`Dynamic' calculates E-values for a given alignment by fitting the results from a databank search to an extreme value distribution using Maximum Likelihood Fitting methods (see Appendix A.2, page 236). Since it was important to be able to produce comparable results between different queries, it was necessary to correct for differences in the length of query proteins as well as template proteins; various methods have been suggested for achieving this (Pearson, 1995; Altschul & Gish, 1996; Pearson, 1998; Mott, 2000). However `Dynamic' uses a simple correction: dividing by the natural logarithm of the product of the query and template lengths. This correction is fast and simple; however, it does not have any theoretical basis within the intricate statistics of sequence alignment, though similar techniques have been shown to be effective (Pearson, 1995; Tomii & Akiyama, 2004). A possible

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

132

explanation for its effectiveness may be derived from Equation A.7 (page 234). This equation states that for a given K and :

E[x] = Kmne-x where E[x] is the expected number of unrelated alignments, with a final score greater than or equal to x, between two random sequences, of lengths m and n, based upon a given scoring scheme. This is equivalent to:

E[x] = Ke- ln(mn)x Suggesting that, when fitting an extreme value distribution to a set of alignment scores, K and are dependent on ln(mn)x. Therefore, by dividing x by ln(mn), the equation becomes:

E[x/ ln(mn)] = Ke-x removing the length dependency from the right-hand-side. Therefore, determination of K and , using length corrected alignment scores, may help to improve recognition accuracy, though this cannot be proven analytically.

3.4

3.4.1

Alignment Algorithms

Sequence-Sequence Comparison

The first algorithms incorporated into `Dynamic' were the established alignment algorithms described in § 1.4.2 (page 24). These were combined with the standard sequence-sequence comparison method where a basic substitution matrix (e.g. BLOSUM62) defines the individual scores for aligning an amino acid in the template sequence to an amino acid in the query sequence (see Figure 3.1(a), page 134). Since `Dynamic' is able to extrapolate probabilities from log-odds scores, the individual scores for aligning two amino acids can either be the log-odds scores (the default,

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

133

as described above) or the raw probabilities.

3.4.2

Profile-Sequence and Sequence-Profile Comparison

The next stage in the development of `Dynamic' was to include highly successful sequence-profile and profile-sequence comparison methods (similar to those used in `3D-PSSM'). As described in § 1.4.6 (page 40), for a profile for a given protein sequence of n residues, the basic substitution matrix is replaced by an n × 20 matrix which defines individual scores for aligning each of the 20 amino acids to each of the n residues of the protein (see Figure 3.1(a), page 134). As in the sequence-sequence comparisons, there is the option to use either the log-odds scores from the profiles (again, the default) or the raw probabilities.

3.4.3

Profile-Profile Comparisons

The natural progression from profile-sequence and sequence-profile alignment methods was to profile-profile comparisons (see Figure 3.1(b), page 134). The difficulty in comparing two profiles concerns how to convert a pair of positional score vectors into a meaningful similarity score. Since there is no standard method of performing a profile-profile comparison, the success of such a comparison is heavily dependent on the algorithm used to calculate this value. Studies have shown that such methods can detect similarities between two protein families that were previously undetectable when using just profile with sequence comparisons (Rychlewski et al., 2000).

When development of `Dynamic' first began (2002/3), various methods of comparing profiles had been reported. Pietrokovski (1996) compared profiles generated from multiple sequence alignments of protein families, but this method did not allow for gaps in the alignment. Lyngso et al. (1999) used the co-emission probability of two profile HMMs to measure their similarity; however, as noted in Yona & Levitt (2002), the metrics used in the comparison were overly sensitive to differences (rather than similarities) between probability distributions, and may have been unable to

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

134

(a)

(b)

sequence profile

sequence

C V L E L A T H Q

E E

W

F

sequence profile

template

E

C V L E L A T H Q W G C Q S

substitution matrix

E

(meta) profile vs. (meta) profile

W W F F

W G H H H C Q S

I

F F

S

N

C

A

F

E

T

A

L

E

I

M

II position specific scoring matrix

S N C A A F F E E T T A A L L

query

E E I I M M

predicted local structure

sequence profile

H H H E E E E E

Figure 3.1: Dynamic programming alignment algorithms. (a) Sequence-sequence, profilesequence, and sequence-profile comparison methods represent a traditional evolutionary-based approach to predict structures of proteins. The simplest method (I) aligns the sequence of the target with the sequence of the template using a substitution matrix. More sensitive methods (II) define scores for aligning different amino acids separately for each position of the target sequence or the template sequence. These scores are derived from multiple sequence alignments of the corresponding sequence families. Such position-specific scores are also called profiles. They are similar in format to the representation of sequence families used by prediction methods based on HMMs. (b) Profile-profile comparison methods utilise the profiles generated by the above mentioned sequence alignment methods. Instead of a lookup of a substitution score, they compare two vectors with each other when building the dynamic programming matrix used to draw the alignment. The comparison is usually conducted by calculating a dot-product of the two positional vectors (as shown in the figure), or by multiplying one vector by a substitution matrix by the other vector; however, there is no set algorithm. Depending on the choice of the comparison function, the vectors are often rescaled before the operation. The sequence variability vectors are sometimes also augmented with Meta information, such as predicted secondary structure, as indicated in the figure. Derived from a figure in Ginalski et al. (2005), originally published under open access by Oxford University Press.

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

135

Table 3.1: Profile-profile comparison algorithms in `Dynamic'. The names of the profile-profile

algorithms provided in `Dynamic' are listed along with the sections of this work in which they are detailed. Each algorithm uses a given combination of probability vectors and/or log-odds score vectors from the template and query profiles. In addition, some use the background information (again, log-odds scores or probabilities) provided by the substitution matrix (e.g. BLOSUM62).

Algorithm name

Uses profile probability vectors or log-odds score vectors Template Log-odds Probabilities Log-odds Probabilities Probabilities Log-odds Probabilities Log-odds Probabilities Both Probabilities Log-odds Query Log-odds Probabilities Probabilities Log-odds Probabilities Log-odds Probabilities Log-odds Probabilities Both Probabilities Log-odds

Uses substitution matrix background information No No No No Log-odds Log-odds Probabilities Probabilities Probabilities No No No

Dot-Product (variant 1 -- Eq. 3.3) Dot-Product (variant 2 -- Eq. 3.4) Dot-Product (variant 3 -- Eq. 3.5) Dot-Product (variant 4 -- Eq. 3.6) BASIC (variant 1 -- Eq. 3.7) BASIC (variant 2 -- Eq. 3.8) BASIC (variant 3 -- Eq. 3.9) BASIC (variant 4 -- Eq. 3.10) PROF SIM (Eq. 3.11) B-DHIP (Eq. 3.17) Pearson's Correlation Coefficient (variant 1 -- Eq. 3.18) Pearson's Correlation Coefficient (variant 2 -- Eq. 3.19)

discern the subtle similarities between protein profiles.

A full list of all the profile-profile methods encoded in the current version of `Dynamic' are listed in Table 3.1 (page 135). 3.4.3.1 Dot-Product Algorithms

The power of techniques that use profiles and HMMs was demonstrated at CASP5. Several groups designed new profile-profile comparison methods to improve their performance; however, the success of the Meta servers and the ab initio techniques detracted from their achievements. Many of the profile comparison methods assessed at CASP5 were based on using a dot-product between individual vectors from the template and query profiles. The most commonly used algorithm was the dot-

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

136

product of the vector scores:

20

D(i, j) =

k=1

[T (i, k) × Q(j, k)]

(3.3)

where D is the dynamic programming score matrix, where D(i, j) is the element in matrix D at coordinate (i, j), T represents the log-odds scores for the template profile, and Q represents the log-odds scores for the query profile. In the above equation, D is an n × m matrix, where n is the length of the template and m is the length of the query, i indexes the position along the template, j indexes the position along the query, k indexes the log-odds score for the 20 different amino acids at position i in the template and position j in the query.

Variants of Equation 3.3 include the dot-product of the probabilities for the template and query profiles:

20

D(i, j) =

k=1

T (i, k) × Q(j, k)

(3.4)

where T represents the probabilities for the template profile, and Q represents the probabilities for the query profile. The other variants are dot-products between log-odds scores and probabilities in both directions:

20

D(i, j) =

k=1 20

T (i, k) × Q(j, k)

(3.5)

D(i, j) =

k=1

T (i, k) × Q(j, k)

(3.6)

Each of these algorithms was included in `Dynamic'. 3.4.3.2 The BASIC Algorithm

One of the earliest, and most we known, profile-profile comparison methods was the BASIC method (Bilaterally Amplified Sequence Information Comparison; Rych-

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

137

lewski et al., 1998) which was successfully applied at CASP3. BASIC compared proteins using dynamic programming and combined the probability data from query and template profiles with the scores from the background comparison matrix (e.g. BLOSUM62):

20 20

D(i, j) =

k=1 l=1

T (i, k) × S(k, l) × Q(j, l)

(3.7)

where S is the substitution matrix. In the above equation, k indexes the probabilities for the 20 different amino acids at position i in the template, and l indexes the probabilities for the 20 amino acids at position j in the query.

Several alternative versions of the BASIC algorithm are also available in `Dynamic'. The first uses the log-odds scores from the template and query profiles instead of the probabilities:

20 20

D(i, j) =

k=1 l=1

[T (i, k) × S(k, l) × Q(j, l)]

(3.8)

The other two versions are:

20 20

D(i, j) =

k=1 l=1

T (i, k) × S(k, l) × Q(j, l)

(3.9)

and:

20 20

D(i, j) =

k=1 l=1

T (i, k) × S(k, l) × Q(j, l)

(3.10)

where S represents the probabilities of the substitution matrix.

The same group that designed the BASIC algorithm later developed FFAS (Rychlewski et al., 2000), a method that aligns profiles in a similar way (using dotproducts between probability vectors, see Equation 3.4, page 136) but also uses a novel method of profile preparation, based on multiple sequence alignments for fam-

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

138

ilies of homologous proteins. Both methods were based on using the dot-product between profile vectors, and demonstrated how use of profile-profile methods could improve recognition compared with simple profile-sequence methods. FFAS was successfully applied at CASP4, suggesting that it is possible to achieve greater recognition by developing more elaborate ways of constructing profiles and by increasing the amount of information used in the profile comparisons. A system similar to FFAS, ORFeus, was later developed; this used probability vector dot-products for primary structure profiles and for predicted secondary structure profiles (Ginalski et al., 2003). 3.4.3.3 The PROF SIM Algorithm

A popular algorithm (PROF SIM), developed by Yona & Levitt (2002), approached the problem from a slightly different angle; rather than focusing on more ornate methods of constructing profiles, it used the Jensen-Shannon divergence between probability distributions (a technique rooted in information theory) to align standard probability profiles from programs such as PSI-BLAST: 1 D(i, j) = (1 - J S T(i) Q(j) )(1 + J S [c P0 ]) 2

(3.11)

where T(i) is the probability vector from the template profile at position i, Q(j) is the probability vector from the query profile at position j, P0 is a vector of the background probabilities for the amino acids, and:

J S T(i) Q(j) = K L T(i) c + (1 - ) K L Q(j) c where:

(3.12)

c = T(i) + (1 - ) Q(j)

(3.13)

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

139

and: xk yk

K L [x y] =

k

xk × log2

(3.14)

and where:

J S [c P0 ] = K L [c d] + (1 - ) K L [P0 d] and:

(3.15)

d = c + (1 - ) P0 The variable is a prior weight, usually set to 0.5.

(3.16)

In their analysis, Yona & Levitt showed that the improvement made by PROF SIM over PSI-BLAST was roughly equivalent to the improvement made by PSI-BLAST over BLAST. Capriotti et al. (2004) also used the Shannon entropy measure as an extra filtering procedure in order to improve alignment accuracy. 3.4.3.4 The B-DHIP Algorithm

The success of the PROF SIM method demonstrated that great improvements could be made in fold recognition when using profile-profile comparisons. The power of the algorithm lay largely in its use of information content from the profile probabilities. The problem with PROF SIM was the large number of computations it required per alignment, rendering it much slower than simpler comparison algorithms. This seemed an unnecessary constraint since the information content of profiles (at least from PSI-BLAST) is implicitly encoded into the calculated log-odds scores (see Appendix B, page 243), which PROF SIM does not use.

The B-DHIP (Bi-Directional Heterogeneous Inner Product) algorithm was developed as a simple comparison method that would combine the profile data from a

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

140

template and a query in the best way in order to reflect their meaning, i.e. utilising probabilities as well as the log-odds score. B-DHIP was designed to be as much an extension to sequence-profile comparison as sequence-profile comparison was to sequence-sequence comparison; aligning a template sequence to a query sequence is functionally equivalent to aligning a template score profile to a query sequence, when the template profile is constructed from the appropriate vectors from the BLOSUM62 substitution matrix. For each position along the template sequence, the vector from BLOSUM62 (representing the residue from that position) is used as the score vector for the template profile.

By accepting the principle that sequence-sequence comparisons can be equated to sequence-profile or profile-sequence comparisons, the analogy can be extended to profile-profile comparisons; for a given template position (i) and a given query position (j) aligning the score vector at i in the template profile to the residue at j in the query sequence is functionally equivalent to aligning (using a dot-product) the score vector at i in the template profile to the probability vector at j in the query profile, where the probabilities for 19 of the amino acids are all 0 and the probability for the single amino acid at position j in the query sequence is 1. When using this principle, by dispersing the probabilities in the query probability vector, to represent the individual tendencies of all 20 amino acids (as in a typical profile), this is equivalent to a standard dot-product between a template score vector and a query probability vector (see Equation 3.5, page 136). By reversing this process, and using a standard dot-product between a template probability vector and a query score vector (see Equation 3.6, page 136), the results can be combined to form:

20 k=1

(T (i, k) × Q(j, k)) + (T (i, k) × Q(j, k)) 2 (3.17)

D(i, j) =

This final algorithm combines the score and probability data from both profiles in a simple way that naturally extends sequence-profile comparisons, and makes use of the information content encoded into the log-odds scores by PSI-BLAST. This

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

141

algorithm is the default setting for profile-profile comparisons in `Dynamic'.

Similar methods were later developed by Panchenko (2003) and Han et al. (2005); both showed considerable success in increasing recognition accuracy. 3.4.3.5 Pearson's Correlation Coefficient of Profile Vectors

Building on the principles exemplified by the PROF SIM algorithm, a method using Pearson's Correlation Coefficient was included to compare the probability vectors from template and query profiles:

20 k=1 20 k=1

T (i, k) - T(i)

2

Q(j, k) - Q(j) (3.18)

2

D(i, j) =

T (i, k) - T(i)

Q(j, k) - Q(j)

where T(i) is the mean average value of all elements in the probability vector from the template profile at position i, and Q(j) is the mean average value of all elements in the probability vector from the query profile at position j.

Though not as thorough as the PROF SIM algorithm, the correlation coefficient was included as a faster alternative. It resembled the algorithm originally used by Pietrokovski (1996) when building LAMA, the main difference being that LAMA compared log-odds score vectors and performed ungapped alignments. This metric has since been used in the profile-profile comparison tool FORTE (Tomii & Akiyama, 2004).

Since Pearson's Correlation Coefficient was designed to measure correlation between linear variables, it was considered that measuring the correlation between a distribution of probabilities was likely to produce more robust results than variables measured on a log-based scale. However, despite this concern, a version of the correlation coefficient that compared log-odds score vectors was also included for

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

142

completeness:

20 k=1 20 k=1

D(i, j) =

[(T (i, k) - T(i) ) (Q(j, k) - Q(j) )] (T (i, k) - T(i) )2 (Q(j, k) - Q(j) )2

(3.19)

where T(i) is the mean average value of all elements in the log-odds score vector from the template profile at position i, and Q(j) is the mean average value of all elements in the log-odds score vector from the query profile at position j.

3.5

Benchmarking Results

From the many alignment methods available in `Dynamic', it was necessary to select a representative subset for the purposes of benchmarking. Due to the scale of the benchmark, it was decided to only perform local (Smith-Waterman) dynamic programming alignments for each method. A summary of the testing results can be seen in Table 3.2 (page 143) and Table 3.3 (page 151).

Table 3.2: Optimal search parameters and testing results for a variety of dynamic programming recognition methods. For each method name,

the first term refers to the template and the second refers to the query. The methods listed are: seq v seq (a sequence-sequence alignment); seq v pssm (a sequence-profile alignment); pssm v seq (a profile-sequence alignment); BASIC1 (a profile-profile alignment using the first variant of the BASIC algorithm; see Equation 3.7); B-DHIP (a profile-profile alignment algorithm developed for the purposes of this research; see Equation 3.17); dot-product2 (a profile-profile alignment using the second variant of the dot-product algorithm between profile probability vectors); Pearson1 (a profile-profile alignment using the first variant of the Pearson's Correlation Coefficient algorithm to compare profile probability vectors); and Pearson2 (a profile-profile alignment using the second variant of the Pearson's Correlation Coefficient algorithm to compare profile log-odds score vectors). Ins, Del, and Aff refer to insertion, deletion, and affine gap penalties respectively. Z-shift and Weight are defined in § 3.3.1 (page 128). All methods were optimised using the full training set and tested using the full testing set. The figures are testing results: the average precision of each method (see § 2.5.1, page 117); the percentage of correct homologous relationships detected at 95% precision or above; and the number of query proteins correctly annotated (out of 50) at 95% precision or above.

Secondary Structure Algorithm Parameters % Average Precision % Recall at 95% Precision Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

n/a

Ins. Aff. Ins. -2.254 -0.470 -1.04 Gap -9.500 Ins.

PSI-BLAST n/a

Sheet Coil -11.236 -13.454 Helix -13.261

n/a

Default

Del. -11.046 -13.162 -10.032 Aff. Gap -0.950 Aff. Ins. -13.595 -13.189 Coil -10.762 -2.369 -0.424 -1.056 Del. -11.495 -13.022 -10.018 Aff. Del. -0.668 -0.842 -0.630 Aff. Del. -0.681 -0.643 -0.373

n/a 26.6

21.1 18.6

23 28

001

seq v seq

002

Helix Sheet

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

seq v seq

n/a

28.4

18.6

28

003

seq v pssm

n/a

54.4

44.9

35

143

continued on next page

continued from previous page

Secondary Structure Algorithm Gap -8.625 Ins. Aff. Ins. -1.867 -0.624 -1.142 Gap -8.875 Gap Aff. Gap -0.233 Z-Shift Primary Gap 0.724 Aff. Gap -0.237 Z-Shift Primary Gap 0.585 Aff. Gap -2.869 -0.287 Z-Shift Primary 0.706 -0.888 Aff. Gap -10.311 -0.477 -12.843 -0.735 -11.596 -0.978 Del. Aff. Del. -0.863 Aff. Gap Precision Precision at 95% Parameters % Average % Recall Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

004

Helix Sheet Coil -10.752 -12.719 -13.398

seq v pssm

n/a

53.9

43.3

35

005

pssm v seq

n/a

49.5

38.5

33

006

-2.330

pssm v seq

n/a

49.8

38.5

33

007

BASIC1

n/a

48.8

27.5

30

008

BASIC1

n/a

-2.372

49.3

32.4

30

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

009

BASIC1

n/a

49.2

28.7

30

continued on next page

144

continued from previous page

Secondary Structure Algorithm Gap Aff. Gap -1.021 Z-Shift Primary Ins. Helix -5.942 -3.180 -4.258 Z-Shift Primary Gap Aff. Gap -0.308 Z-Shift Primary Gap 0.637 Aff. Gap -0.315 Z-Shift Primary 0.734 0.588 -0.403 -3.933 -0.276 -0.313 -2.843 -0.479 -0.860 -4.271 -0.123 Aff. Ins. Del. Aff. Del. 0.690 Precision Precision at 95% Parameters % Average % Recall Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

010

BASIC1

n/a

-10.212

44.3

17.8

23

011

Coil

B-DHIP

n/a

Sheet

67.9

50.2

36

012

B-DHIP

n/a

-3.082

69.2

51.4

34

013

B-DHIP

n/a

-3.148

65.5

46.2

33

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

continued on next page

145

continued from previous page

Secondary Structure Algorithm Gap Aff. Gap -0.283 Z-Shift Primary Gap Aff. Gap -1.113 Z-Shift Primary 0.312 Z-Shift Primary Gap Aff. Gap -1.054 Z-Shift Primary Gap -0.068 Aff. Gap -0.758 Z-Shift Primary -0.065 0.532 0.400 Precision Precision at 95% Parameters % Average % Recall Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

014

B-DHIP

n/a

-2.832

62.0

46.2

34

015

B-DHIP

n/a

-11.125

56.0

34.4

31

016

-10.538

B-DHIP -- ungapped

n/a

60.0

36.4

32

017

dot-product2

n/a

40.6

24.7

29

018

dot-product2

n/a

-7.581

41.4

25.5

31

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

continued on next page

146

continued from previous page

Secondary Structure Algorithm Gap Aff. Gap -1.348 Z-Shift Primary Gap Aff. Gap -0.723 Z-Shift Primary Gap Aff. Gap -0.079 Z-Shift Primary Gap -11.884 -0.240 Aff. Gap -1.188 Weight Primary Secondary 0.301 1.000 Z-Shift 0.510 n/a -0.068 -0.070 Precision Precision at 95% Parameters % Average % Recall Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

019

dot-product2

n/a

-13.482

41.2

24.3

30

020

dot-product2

n/a

-7.233

40.6

24.7

29

021

Pearson1

n/a

-0.786

9.8

2.4

3

022

B-DHIP

seq v seq

60.9

39.3

33

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

continued on next page

147

continued from previous page

Secondary Structure Algorithm Gap -2.817 -0.282 Weight Primary Secondary Gap -2.791 -0.279 Weight Primary Secondary Gap -4.770 -0.477 Weight Primary Secondary Gap -3.065 1.569 0.176 Z-Shift 0.694 -0.481 Aff. Gap -0.307 Weight Primary Secondary 1.000 0.176 Z-Shift 0.734 -0.485 Aff. Gap 0.248 -0.893 1.000 0.777 Z-Shift Aff. Gap 0.280 -0.926 1.019 0.798 Z-Shift Aff. Gap Precision Precision at 95% Parameters % Average % Recall Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

023

B-DHIP

BASIC1

63.5

48.6

35

024

B-DHIP

BASIC1

67.2

52.2

37

025

B-DHIP

B-DHIP

68.9

54.2

36

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

026

B-DHIP

B-DHIP

67.3

53.4

36

continued on next page

148

continued from previous page

Secondary Structure Algorithm Gap -2.904 -0.290 Weight Primary Secondary Gap -2.696 -0.270 Weight Primary Secondary Gap -2.450 -0.245 Weight Primary Secondary Gap -2.508 0.874 0.328 Z-Shift 0.599 -0.519 Aff. Gap -0.251 Weight Primary Secondary 1.000 0.373 Z-Shift 0.583 -0.493 Aff. Gap 0.496 -0.707 1.000 0.688 Z-Shift Aff. Gap 0.527 -0.702 1.206 0.639 Z-Shift Aff. Gap Precision Precision at 95% Parameters % Average % Recall Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

027

B-DHIP

dot-product2

66.3

52.6

37

028

B-DHIP

dot-product2

68.2

53.8

37

029

B-DHIP

Pearson1

68.5

55.0

38

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

030

B-DHIP

Pearson1

68.9

56.7

38

continued on next page

149

continued from previous page

Secondary Structure Algorithm Gap -8.986 -0.899 Weight Primary Secondary -0.128 -0.544 0.886 0.466 Z-Shift Aff. Gap Precision Precision at 95% Parameters % Average % Recall Queries annotated above 95% Precision

Method

Primary

Identifier

Structure

Algorithm

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

031

B-DHIP

Pearson2

57.1

38.5

33

150

Table 3.3: Testing results for a variety of optimised dynamic programming recognition methods. For each method name, the first term refers to

the template and the second refers to the query. The methods listed are: seq v seq (a sequence-sequence alignment); seq v pssm (a sequence-profile alignment); pssm v seq (a profile-sequence alignment); BASIC1 (a profile-profile alignment using the first variant of the BASIC algorithm; see Equation 3.7); B-DHIP (a profile-profile alignment algorithm developed for the purposes of this research; see Equation 3.17); dot-product2 (a profileprofile alignment using the second variant of the dot-product algorithm between profile probability vectors); Pearson1 (a profile-profile alignment using the first variant of the Pearson's Correlation Coefficient algorithm to compare profile probability vectors); and Pearson2 (a profile-profile alignment using the second variant of the Pearson's Correlation Coefficient algorithm to compare profile log-odds score vectors). Pseudo-Boosting Iteration refers to any subsequent round of optimisation performed as part of a pseudo-boost (see § 3.5.2, page 156). EP refers to Empirical Precision (see § 2.5.3, page 120). All methods were optimised using the full training set and tested using the full testing set. The figures are testing results: the average precision of each method (see § 2.5.1, page 117); the percentage of correct homologous relationships detected at 95% precision or above; the number of query proteins correctly annotated (out of 50) at 95% precision or above; the number of query proteins correctly annotated (out of 50) at 0.95 EP or above; and the difference between the number of queries annotated at 95% precision or above, but below 0.95 EP.

Secondary Structure Algorithm Iteration Precision Boosting Precision at 95% Pseudo% Average % Recall Queries annotated above 95% Precision Queries annotated above 0.95 EP Queries annotated above 95% Precision and below 0.95 EP

Method

Primary

Identifier

Structure

Algorithm

n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 28.4 54.4 53.9 49.5 n/a 26.6

PSI-BLAST

n/a

n/a

n/a

21.1 18.6 18.6 44.9 43.3 38.5

23 28 28 35 35 33

n/a 28 28 35 32 35

n/a 0 0 3 4 0

continued on next page

001

seq v seq

002

seq v seq

003

seq v pssm

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

004

seq v pssm

005

pssm v seq

151

continued from previous page

Secondary Structure Algorithm Iteration Precision 95% Precision 0.95 EP Boosting Precision at 95% annotated above annotated above Pseudo% Average % Recall Queries Queries Queries annotated above 95% Precision and below 0.95 EP

Method

Primary

Identifier

Structure

Algorithm

006 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 2nd 1st n/a n/a 60.0 40.6 41.4 41.2 3rd 56.0 2nd 62.0 46.2 34.4 36.4 24.7 25.5 24.3 1st 65.5 46.2 n/a 69.2 51.4 n/a 67.9 50.2 36 34 33 34 31 32 29 31 30 3rd 44.3 17.8 23 2nd 49.2 28.7 30 1st 49.3 32.4 30 n/a 48.8 27.5 30 1 31 17 9 33 14 30 35 34 33 0 21 20

pssm v seq

n/a

n/a

49.8

38.5

33

33

0 29 0 20 21 12 30 12 0 0 0 29 20 18

continued on next page

007

BASIC1

008

BASIC1

009

BASIC1

010

BASIC1

011

B-DHIP

012

B-DHIP

013

B-DHIP

014

B-DHIP

015

B-DHIP

016

B-DHIP -- ungapped

017

dot-product2

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

018

dot-product2

019

dot-product2

152

continued from previous page

Secondary Structure Algorithm Iteration Precision 95% Precision 0.95 EP Boosting Precision at 95% annotated above annotated above Pseudo% Average % Recall Queries Queries Queries annotated above 95% Precision and below 0.95 EP

Method

Primary

Identifier

Structure

Algorithm

020 n/a seq v seq BASIC1 BASIC1 B-DHIP B-DHIP dot-product2 dot-product2 Pearson1 Pearson1 Pearson2 n/a n/a 68.9 57.1 n/a 68.5 n/a 68.2 53.8 55.0 56.7 38.5 n/a 66.3 52.6 n/a 67.3 53.4 n/a 68.9 54.2 36 36 37 37 38 38 33 n/a 67.2 52.2 37 n/a 63.5 48.6 35 n/a 60.9 39.3 33 n/a 9.8 2.4 3 2 37 37 37 35 34 36 36 37 36 30

dot-product2

n/a

3rd

40.6

24.7

29

17

21 1 0 0 0 12 11 5 10 3 13 9

021

Pearson1

022

B-DHIP

023

B-DHIP

024

B-DHIP

025

B-DHIP

026

B-DHIP

027

B-DHIP

028

B-DHIP

029

B-DHIP

030

B-DHIP

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

031

B-DHIP

153

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

154

3.5.1

Methods 001 to 006 (sequence-sequence and sequenceprofile methods)

A PSI-BLAST search was conducted against the selected sequences in the template fold library for use as a benchmark baseline. The average precision (AP) value for the PSI-BLAST results could not be calculated, as PSI-BLAST does not return E-values for every query-template (QT) pair from a database; only results with Evalues below or equal to 1.0 were returned. However, an analysis for the testing set (using standard recommended parameters) returned 52 of the possible 247 correct QT pairs at 95% precision (21.1% recall); of these, 23 queries were represented (out of a possible 50 testing queries). Previous evidence from Rychlewski et al. (2000) has also suggested that an E-value of 1.0 for PSI-BLAST is enough to find the highest recall at the 95% precision level.

The standard sequence-sequence (seq v seq), sequence-profile (seq v pssm), and profile-sequence (pssm v seq) methods were all included and optimised independently using structure-specific gaps (with affine penalties optimised independently of opening penalties). These are listed in Table 3.2 as methods 001, 003, and 005 respectively. In all cases, the penalties for opening gaps in template helix and sheet structures were greater than those for coil structures. This is unsurprising since gaps in coil structure are assumed to be less detrimental to a protein's core structure than gaps in secondary structure elements. Additionally, opening insertion penalties in helix and sheet structures were reasonably similar within, and across, all three methods. Opening insertion penalties in coils were also similar across all three methods. Extension gap penalties were reasonably similar across all three methods. However, whereas the insertion opening and extension penalty ratio in coil structures was close to the 11:1 ratio employed by PSI-BLAST, the ratio for insertions was between 6:1 and 7:1 in helix, and in sheet was approximately 30:1 for methods 001 and 003, and approximately 20:1 for method 005. These results suggest that, when inserting new residues into a protein, initial insertions are generally less accommodated in

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

155

secondary structure elements than in coil; however, extending an insertion in sheet is often easier than extending an insertion in coil, whereas extending an insertion in helix is often much harder than extending an insertion in either of the other two structure types.

When examining the results for deletion penalties, the opening penalties were similar across all three methods. Similarly, the penalty for opening deletions in a helix or sheet was noticeably larger than that for coil. However, the penalties for introducing deletions in template helix structures were all approximately midway between the penalties for deletions in sheet and coil. The extension deletion penalties did not show the disparity demonstrated by extension insertion penalties, however, penalties to extend deletions in helix and sheet were all greater than those for coil.

Similar analyses for sequence-sequence, sequence-profile, and profile-sequence were also performed using non-structure-specific gaps (with affine penalties fixed at 1/10 of the value of the opening penalty). These are listed as methods 002, 004, and 006 respectively. It is interesting to note that the gap opening penalties for all three methods were substantially lower than the respective penalties from the structurespecific analyses. Even when the relative frequencies of helix, sheet, and coil were taken into account (as described in § 2.4.1.3, page 111), this did not account for the overall difference in the penalty values. Such a difference could be a result of the condition that gap extension penalties are fixed at 1/10 of the value of gap opening penalties; it may be that the gap extension penalty is a more influential characteristic of the optimisation process. This, in turn, would suggest that a fairer way of describing the penalty values would be to say that the opening penalty is fixed at 10 times the value of the extension penalty, rather than the other way around. Alternatively, the non-structure specific penalty values may reflect the overall tendency for gaps to occur in remote homology alignments, whereas the structure-specific penalties may reflect the general distribution of gaps across the different structure

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

156

types.

It should be noted that the results for methods 003 and 004 (primary structure template sequence against primary structure query profile) had greater recall than those for methods 005 and 006 (primary structure template profile against primary structure query sequence), despite the fact that methods 005 and 006 (theoretically) utilised more information in the template profiles. One possible explanation is that using query profiles against a databank of template sequences may help identify remote homologues to the query, whereas scanning a query sequence against a databank of template profiles effectively reverses the logic (i.e. using the remote homology information of the templates). If the focus of the search is to identify templates that are homologous to a given query, it may not be as efficient to perform the search in the opposite direction (where the template becomes the focus).

3.5.2

Methods 007 to 021 (profile-profile methods)

Three of the implemented profile-profile comparison methods were also benchmarked using primary structure profiles. These methods were the first variant of the BASIC method (BASIC1 ), the dot-product of probability vectors (dot-product2 ), and the B-DHIP method. These are listed as methods 007, 012, and 017 respectively.

All three profile-profile methods were reoptimised multiple times using a form of Boosting referred to in this work as pseudo-boosting. Pseudo-boosting is very simple: any training query, whose individual AP value (i.e. its AP when calculated in isolation) is at least 80% of the final AP of the optimized simplex, is excluded from the training set. The aim of this step is to remove any queries that are steering the simplex in a particular direction, in order to give the more difficult queries a greater influence on the optimisation function. The new reduced training set is then used to reoptimize the simplex. This process was performed three times (on an increasingly reduced training set) for each of the profile-profile methods. These

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

157

methods are listed as (in descending order of training set size) methods 008, 009, and 010 for BASIC1; 013, 014, and 015 for B-DHIP; and 018, 019, and 020 for the dot-product2 method (see Table 3.3, page 151).

During the training of the pseudo-boosting for each profile-profile method, it became clear that the majority of queries could be annotated using the first set of optimised parameters. Removing a subset of queries at the first stage of the pseudo-boosting did not have a great effect on the optimisation of the BASIC1 or the B-DHIP methods, with both producing similar parameter sets as before (going from method 007 to 008, and from method 012 to 013, respectively). As more stages of pseudo-boosting were performed, fewer queries were removed at each stage until (by the fourth stage) little improvement could be achieved by continuing with the process. The training set had, effectively, been reduced to only those queries that would either require individually tailored search parameters in order to find any homologues, or could never have their homologues identified with the given recognition algorithm (regardless of the search parameters used). The results from the testing set also reflect this, as the parameter sets from the later stages of the pseudoboosting are progressively able to find fewer and fewer correct QT relationships.

The pseudo-boosting results for the dot-product2 algorithm (methods 017 to 020) show that the parameter sets tend to fluctuate more than they did for the BASIC1 or B-DHIP algorithms in the earlier stages. However, the AP, high recall at 95% precision (or above), and the number of queries annotated at 95% precision (or above) all suggest that the dot-product2 method tends to be more resilient to changes in search parameters, even though, overall, it does not perform as well as the best results from BASIC1 or B-DHIP.

Two extra variations of the B-DHIP algorithm were performed. The first was optimised using structure-specific gaps (similar to those used in methods 001, 003, and 005), listed as method 011. The second was optimised using ungapped alignments

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

158

(only optimising the z-shift), listed as method 016. As with methods 001 to 006, there was a marked difference between the size of the penalties in methods 011 and 012; the penalties for opening and extension insertions showed a similar distribution in method 011 as they did in methods 001, 003, and 005 (except the low penalty for openings in sheet). However, the deletion penalties were distinctly different; most notably, penalties for opening deletions in sheet were lower than those in coil, whilst the penalty for extension deletions in sheet were higher than that for coil or helix. Additionally the penalty for deletion extensions in helix was also lower than that for coil. It is difficult to extrapolate potential explanation from these results. However, given that insertions and deletions in sheet seem to be the most favourable of all gaps, this suggests that the tendency of sheets to retain evolutionary recognisable features (though not necessarily key structural features) is not so prevalent in more remote homologous relationships. Method 016 was carried out to see how well B-DHIP would perform when only allowed to match single contiguous regions between QT pairs; in comparison to methods 011 to 015, this was one of the poorest performing implementations of the B-DHIP algorithm.

The Pearson's Correlation Coefficient of probability vectors comparison method was also used (Pearson1, method 021); however, it was not expanded upon (as the other methods were) due to its initial poor performance. This was surprising given the success of other groups at demonstrating its recognitive ability. However, closer examination revealed that the previous failures were due to the occurrence of low complexity regions, or regions where no sequence homologues could be detected by PSI-BLAST, in the profiles. Since these regions were all marked by SEG with identical (low) probability distributions, the correlation coefficients between them produced perfect match scores, causing unrelated QT pairs to be scored highly purely on the basis of mutual regions of low complexity. Since SEG marks these regions with negative log-odds scores (across all amino acid types), they are not usually a problem for most alignment algorithms; however, because this version of the correlation coefficient does not take log-odds scores into account, and the

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

159

procedure used in this research did not perform prefiltering of such regions, the problem was prevalent. Despite this, method 021 was still able to identify many remote QT relationships at low E-values, suggesting that it still had much to offer, if the issue of low complexity could be addressed.

3.5.3

Methods 022 to 031 (profile-profile with secondary structure)

Many studies before, and since, these benchmarking phases have shown the advantages of using higher level structural information when trying to improve recognition accuracy, either through direct use of predicted secondary structure (Kelley et al., 2000; Ginalski et al., 2003; Tang et al., 2003; Ginalski et al., 2004; Przybylski & Rost, 2004) or assessment of model quality (Zhou & Zhou, 2004; Pettitt et al., 2005). Given the ability of `Dynamic' to extrapolate log-odds scores from probabilities (and vice versa), all three of the main profile-profile methods could be applied to either primary or secondary structure information. Therefore, this research explored the benefits of including predicted secondary structure information by optimising the single best performing primary structure recognition algorithm (B-DHIP) in conjunction with predicted secondary structure profile comparison.

Initially, primary structure B-DHIP comparison was combined with predicted secondary structure sequence-sequence comparison (method 022), with the secondary structure weighting fixed at 1.0. The results were noticeably worse than B-DHIP alone (method 012), probably due to the higher weighting given to the predicted secondary structure. This outcome was probably the result of the simplex optimisation algorithm failing to find the global optimal parameters for the method. However, given the high relative weighting for the predicted secondary structure comparison in the final parameter set, it did demonstrate the power of adding this additional information to a recognition method, since it outperformed most of the other recognition methods.

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

160

Based on the results of method 022, the B-DHIP primary structure comparison algorithm was optimised with the BASIC1, B-DHIP, and dot-product2 predicted secondary structure comparison algorithms, either with no fixed weightings or (this time) with primary structure weighting fixed at 1.0. Given that the predicted secondary structure profiles were derived from PSIPRED, the Pearson's Correlation Coefficient of predicted secondary structure probability vectors comparison method (Pearson1) was also combined with B-DHIP. The reason for this was that PSIPRED produces probability profiles rather than log-odds score profiles for functional use; ignoring log-odds scores was not a problem since no log-odds scores extrapolated by `Dynamic' contained any information absent in the probability profiles.

In each case, there were improvements in either the recall at high precision or the number of queries annotated at high precision, or both, over method 012. However, when compared to method 011, some of these changes were either negligible or were, in fact, reductions in performance. The use of the BASIC1 method when comparing predicted secondary structure profiles had mixed results. Method 023 was outperformed by method 011 in all three measures of efficiency. However, method 024 was comparable to method 012 (and slightly better in terms of high precision results). A possible explanation for this may be that one of the major components in the BASIC1 algorithm (i.e. the log-odds substitution matrix for predicted secondary structure) was artificially created with no analytical evaluation (see § 2.4.1.2, page 111). If an analytically derived substitution matrix were to be used, such as the one that was later suggested by Wang & Dunbrack (2004), the results may have been improved.

Methods 025 and 026 (using B-DHIP to compare predicted secondary structure profiles) failed to show much improvement over primary structure B-DHIP alone, except in the percentage of high precision recall. The predicted secondary structure substitution matrix may have also contributed to this because (when only raw profile

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

161

probabilities are available) `Dynamic' uses the values of the given substitution matrix in order to extrapolate the log-odds scores of the predicted secondary structure profiles (see § 3.3.3, page129). Since the predicted secondary structure substitution matrix was constructed using arbitrarily chosen values (i.e. +1 for correct matches and -1 for incorrect matches), any log-odds scores derived by `Dynamic', and used by the B-DHIP method, may have been unrepresentative of their true value.

Methods 027 to 030 compare predicted secondary structure profiles without any use of the respective substitution matrix, and the results appear noticeably better than any result from methods 022 to 026. Method 030 (combining B-DHIP with Pearson1) could be regarded as the single best performing method given its AP value, the percentage recall at high precision, and the number of queries correctly annotated at high precision.

Finally, method 031 (using Pearson's Correlation Coefficient of predicted secondary structure log-odds score vectors -- Pearson2 ) shows a fall in performance, in every measure of recognition, when compared to the B-DHIP primary structure comparison alone.

It is interesting to note that all the methods that make use, either directly or indirectly, of the predicted secondary structure substitution matrix (methods 022 to 026, plus 031) fail to show any substantial improvement in performance over the single best primary structure based recognition method. In several cases, there is actually a fall in performance. All other methods were shown to provide noticeable, though not dramatic, improvements. The fact that the Pearson Correlation Coefficient was able to perform so well when comparing predicted secondary structure profiles shows the power of the algorithm, which was not evident when used on primary structure profiles. Clearly, when constructing a fold recognition algorithm, it is important to take into account the purpose for which the input data is designed. In the case of PSI-BLAST profiles, the information content is contained

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

162

within both the log-odds scores and the probabilities; therefore, ignoring either one of these would be a failure to make use of all available information. In the case of PSIPRED profiles, the information content is contained purely within the probabilities; therefore, any attempts to calculate log-odds scores using unsubstantiated background information is likely to produce erroneous data.

3.5.4

ROC Analysis of Methods

When comparing the performance of recognition algorithms, the most commonly used method is to use a metric that compares the number of true relationships against the number of false relationships as the theoretical measure of accuracy decreases (in this case the EP value). A widely used measure is the truncated receiver operating characteristic (ROC) (Gribskov & Robinson, 1996). ROCn is calculated as the sum of the number of true positives found before the first 1, 2, 3, . . . , n false positive (ti ), divided by the overall number of true positives in the databank (T ):

i=1,...,n ti

ROCn =

nT

(3.20)

This research used the same techniques employed by Panchenko (2003), which were originally described by Sch¨ffer et al. (2001), in order to perform ROC analyses. a In these works, the distribution of ROC values is shown to be approximately normal, and its variance can be calculated analytically as:

i=1,...,n (tn+1 n2 T 2

2 (ROCn ) =

- ti )2

(3.21)

where 2 (a) represents the variance of a. They also suggest a methodology for statistically comparing ROC values. They show that the mean of the difference between two ROC values, under typical conditions, can be regarded as:

µ(ROCn - ROCn ) = µ(ROCn ) - µ(ROCn ) ROCn - ROCn

(3.22)

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

163

where µ(a) represents the mean of a, ROCn is a ROC value different from ROCn . The variance of the difference between two ROC values, under typical conditions, can be regarded as:

2 (ROCn - ROCn ) 2 (ROCn ) + 2 (ROCn )

(3.23)

Using these values, it is fairly simple to compare the ROC values for each method using a two-sample unpooled Student's T-test. It should be noted that Equation 3.23 ignores a final correction term from the true analytical calculation, which leads to an overestimation of the standard deviation. As a result, any T-test results are conservative.

Gribskov & Robinson (1996) recommend ROC50 as providing an accurate reflection of accuracy for large databank searches. Based upon the added recommendaa tions of Sch¨ffer et al. (2001), both ROC50 and ROC100 analyses were performed on all 31 optimised `Dynamic' recognition methods. The values of 50 and 100 were chosen on the basis of there being 50 individual query proteins in the testing data set, and of each query having a minimum of two correct answers in the fold library.

Each of the analyses performed showed how significantly different the various recognition methods are, when comparing ROC values. The results for the ROC50 analysis can be seen in Figure 3.2 (page 165), and the results for the ROC100 analysis can be seen in Figure 3.3 (page 166). Any squares coloured blue indicate that the respective method along the top of the chart is significantly better than the respective method down the right-hand-side of the chart. Any squares coloured red indicate that the respective method along the top of the chart is significantly worse than the respective method down the right-hand-side of the chart. In both cases the darker the colour of the square the more significant the difference. White squares represent no significant difference.

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

164

The results from these analyses generally agree with the comments made above about each individual method. Methods 011, 012, 025, 028, 029, and 030 are all shown to be statistically better (or no worse) than all other methods (including each other); each of them uses the B-DHIP algorithm for aligning primary protein structure.

3.6

Benchmarking Discussion

As mentioned in § 3.2 (page 127), the primary goal of this chapter was to develop a series of optimised recognition algorithms that could provide substantially more coverage of the recognition search space than any single method. This could be regarded as a success, given the results in Figure 3.2 (page 165) and Figure 3.3 (page 166). Ideally, these figures should have as few white squares as possible (these signify that two methods are not significantly different); whether particular methods are better (blue) or worse (red) is not important. What is important, is that all the methods are as different from each other as possible.

The aim of this element of this research was to emulate a situation, similar to the way in which many Meta servers are developed, where the quality of the final ensemble is the main focus of rather than the individual constituent algorithms. This is not to say that the results of this benchmark are uninteresting; as more results were produced, it became possible to notice patterns that could potentially be useful in future work that explores further enhancement of recognition algorithms. The conclusions described below center around the use of structural information and the nature of the input data.

Reviewing the results from the `Dynamic' benchmark offers some insights into the underlying nature of many of the individual recognition algorithms. Perhaps the most striking aspect of the first six analyses (methods 001 to 006) is that there is very little difference between the AP values, high precision recall values, and the

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

165

Figure 3.2: ROC50 analysis of all optimised `Dynamic' classifiers. It is read from the top

down, therefore it describes method 012 as being significantly better than method 010 at the 0.05% significance level. White squares indicate that there is no statistically significant difference between the two methods. Significance scores were calculated using a T-test. Mean values, and standard deviations, were calculated using the equations from Sch¨ffer et al. (2001); Panchenko a (2003). Images created with matrix2png (Pavlidis & Noble, 2003).

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

166

Figure 3.3: ROC100 analysis of all optimised `Dynamic' classifiers. It is read from the top

down, therefore it describes method 012 as being significantly better than method 010 at the 0.05% significance level. White squares indicate that there is no statistically significant difference between the two methods. Significance scores were calculated using a T-test. Mean values, and standard deviations, were calculated using the equations from Sch¨ffer et al. (2001); Panchenko a (2003). Images created with matrix2png (Pavlidis & Noble, 2003).

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

167

number of queries annotated at high precision, when comparing structure-specific and non-structure-specific penalties (although the ROC analyses between methods 001 and 002 do mark them as being significantly different). This suggests that alignment accuracy (from the point-of-view of individual residue alignment) may not necessarily be a prerequisite for efficient template recognition. This would further suggest that features important for recognition (e.g. structural and functional motifs) may occur as discrete, ungapped elements in protein chains; the accuracy of individual residue alignment in between these elements having little effect on the overall efficiency of the recognition algorithm.

If the above assumptions are true, then it would be interesting to see how the final alignments between methods using structure-specific and non-structure-specific gap penalties would compare when analysing homologous relationships. Some recent studies (performed as a side interest to this research), using the same training and testing data, have examined the efficiency of recognition algorithms that use ungapped, equal-sized fragments of query proteins to search through the template fold library (data not shown). The most striking results showed that many of the most recognisable features in the query proteins tended to occur in regions of conserved functionality; in addition, these regions frequently consisted of discrete clusters of overlapping fragments that were separated by large, contiguous gaps.

A further examination of this theory can be made from the results for methods 011, 012, and 016 (using the B-DHIP algorithm on primary structure with structurespecific gaps, non-structure-specific gaps, and ungapped alignments respectively). While the measures of accuracy for methods 011 and 012 were fairly similar, the same values for method 016 were noticeably lower -- particularly the percentage recall at 95% precision (13.8% less than method 011 and 15.0% less than method 012). Since the results obtained with method 016 were ungapped, any homologous relationships could only be identified by aligning stretches of contiguous residues, suggesting that the discrete nature of features is important for recognition. Such

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

168

hypotheses could form the basis of future work examining the nature of features that are key to fold recognition.

When analysing the methods that used profile-profile algorithms, there is a noticeable difference in quality between those that used the B-DHIP algorithm in the primary structure alignment and those that did not, i.e. the BASIC and dot-product variants. Given the success of the BASIC and ORFeus fold recognition servers, it raises the question of why these algorithms performed relatively poorly, in comparison to the B-DHIP algorithm, in this research. In both the BASIC and ORFeus servers, much of the recognition quality is dependent on the preprocessing of input profiles. The profiles in this research are (more or less) taken directly from PSIBLAST outputs, whereas the profiles used in the BASIC and ORFeus server fold libraries are often built using more elaborate methods, including such techniques as the normalisation of score distributions. It is highly likely that these profile-profile alignment algorithms are better suited to these types of data, particularly when compared to the B-DHIP algorithm (which was specifically designed to be used with PSI-BLAST profiles).

Similar conclusions can be drawn on a more general level when considering how profile data is designed to be used, and whether that data has any analytical credibility. For example, using the BASIC1 algorithm to compare secondary structure profiles, in combination with the B-DHIP algorithm for primary structure profiles, actually decreased recognition accuracy, when compared to B-DHIP primary structure profile comparison alone (method 023 compared to method 012). However, when the secondary structure profiles were compared using the Pearson Correlation Coefficient for the probability vectors (methods 029 and 030) there were improvements over methods 011 and 012 (though not significant). There are several ways in which these results can be interpreted. Given that the secondary structure substitution matrix used in this research was not analytically derived, it is possible that this may have actively hindered recognition accuracy. As noted in § 3.5.3 (page 159),

Assessment and Optimisation of Recognition Algorithms Using `Dynamic'

169

had an analytically derived substitution matrix been used, then the results might have improved. Another possible explanation is that, when comparing profile information, it is important to consider the nature of the data being used. The secondary structure profile vectors used in this research were taken directly from PSIPRED as a series of probabilities; therefore, the Pearson Correlation Coefficient would have been an ideal tool with which to compare them (leading to an improvement in accuracy). However, the BASIC1 algorithm is less likely to function effectively in this instance since it requires a degree of profile preprocessing and use of a (in this case non-analytically derived) substitution matrix. The strength of algorithms that employ background information (e.g. BASIC, PROF SIM) lies in their ability to distinguish a recognition signal from background noise; the B-DHIP algorithm does not need to do this as it already uses the Henikoff weightings that are intrinsically encoded into the PSI-BLAST profiles when they are calculated (see Appendix B.2, page 244); therefore, there is no need to add an extra layer of complexity.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

170

Chapter 4 Development and Optimisation of an Enhanced Fold Recognition Ensemble

4.1 Summary

This chapter describes the methodology and assessments used in developing the enhanced fold recognition ensemble system in `Phyre'. § 4.2 provides a brief overview of the current state-of-the-art with respect to ensemble development in protein fold recognition. This section also describes how the Empirical Precision (EP) metric could be used to circumvent some of the short-comings of the more successful fold recognition ensemble methods. § 4.3 describes an unsuccessful attempt to produce a fold recognition ensemble using the established methods of Bagging and Boosting. It also offers an explanation as to why these methods failed to work. § 4.4 gives an outline of how the training and testing data for ensemble development were built; it describes how all correct examples within the training and testing data sets that could be confidently identified were removed, so that all correct answers produced by an ensemble

Development and Optimisation of an Enhanced Fold Recognition Ensemble

171

were only produced as a result of the ensemble. § 4.5 details an analysis using Support Vector Machines (SVMs) as a means of generating an ensemble. The analysis shows that, by combining the confidence measures of individual recognition algorithms, an SVM is capable of producing a highly effective ensemble. Also, it goes on to show that, by combining these results with additional structural information, the accuracy of the ensemble could be greatly enhanced. Finally, § 4.6 details an analysis of one of the most successful Meta server techniques, 3D-JURY, and shows how it can be extended to incorporate confidence measures from individual, constituent recognition methods. This new protocol, 3D-COLONY, is shown to be more effective than 3D-JURY at high confidence protein fold recognition. The results suggest that controlled development of Meta servers has the potential to be far more effective than traditional Meta server development which, by necessity, must discard large amounts of potentially useful information.

4.2

Introduction

Following the optimisation and benchmarking of many different fold recognition algorithms using `Dynamic' (see § 3, page 126), the final stage in the development of the `Phyre' system was to design and construct a robust fold recognition ensemble.

Construction of ensemble systems is a well-established field of computational research in its own right. Work in this field is referred to as the building an ensemble of classifiers. A classifier can describe any form of algorithm that is capable of categorising input data; however, in the specific context of fold recognition, a classifier refers to any single, stand-alone fold recognition server or method. Therefore, a fold recognition ensemble consists of a number of fold recognition methods combined in such a way as to increase the accuracy of the ensemble over its component classifiers. The accuracy of fold recognition ensembles is measured according to the number of

Development and Optimisation of an Enhanced Fold Recognition Ensemble

172

query-template (QT) pairs that it correctly annotates at high precision (95% or above).

4.2.1

State-of-the-Art Ensembles

The application of ensemble techniques to the problem of fold recognition, in the form of Meta servers, has only been comparatively recent (CASP5 in 2002). As described in § 1.6 (page 63), Meta servers are currently the most successful method of fold recognition available; however, the infrastructure of a typical Meta server is highly dependant and uncontrollable. For example, a given Meta server often utilises the models produced by a number of individual recognition servers as its input data; since each constituent server is likely to use a different scoring scheme and template library for the models it produces, any associated confidence scores are unlikely to be comparable. In some cases, it may be possible to develop a server-specific protocol to normalise any confidence scores produced. However, since the constituent classifiers of a Meta server are usually developed in different laboratories, using different training and testing data, and are only accessible via the internet, any attempt to do this is likely to be inadequate.

Similarly, there is never any guarantee (or degree of control) over how many input servers will be available at any given period of time. If a Meta server develops a dependency whereby it fails to work correctly if any individual classifier suddenly becomes unavailable, then this can lead to problems of inconsistency and lack of reliability. To overcome these limitations, and to try to improve robustness, it is not unusual for a given Meta server to take the top n results from each of its constituent servers (where n is usually a value between 1 and 10) and then disregard their source and associated confidence scores. By doing this, the Meta server is less likely to be susceptible to problems that are beyond its control (e.g. the sudden loss of a constituent server, inconsistencies between fold libraries, etc); however, it also means that a large quantity of potentially useful information is ignored.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

173

4.2.2

A New Approach to Fold Recognition Ensembles

At the time of writing, no large-scale studies exist that question the optimal methodology of building a fold recognition ensemble, although there have been small scale experiments, which use several methods to build a consensus approach (Fischer, 2000). As discussed above, a major setback to performing any such large-scale, analytical benchmark has been the use of confidence measures that were disparate and not comparable across the individual recognition servers. § 2.5.3 (page 120) outlines this issue in greater detail, and proposes the Empirical Precision (EP) metric as a potential solution to the problem. By using a normalised metric for measuring the confidence of an individual recognition method, it becomes possible to apply some of the more well-established techniques of analysing ensemble systems (as well as several others) to fold recognition Meta servers.

This work presents an in-depth analysis of various methods of ensemble construction and techniques of performance enhancement. It also discusses the consequences of centralised design, the major implementation issues encountered, and implications for future development.

4.3

Bagging and Boosting Ensembles

Reviewing the literature on building traditional ensembles of computational machinelearning algorithms, Kuncheva & Whitaker (2003) showed that the rational design approach (i.e. using logic effectively to combine classifiers) is an ongoing challenge. However, their conclusions, and the conclusions taken from Whitaker & Kuncheva (2003), also act as a reminder that Bagging and Boosting (see § 1.7, page 83) are commonly regarded as the best available methods for generating classifiers that will produce an ensemble of increased accuracy. It is believed that the reason for their success is that they force each new classifier included into an ensemble to be progressively more `diverse' than the last; however, these methods do not use any explicit

Development and Optimisation of an Enhanced Fold Recognition Ensemble

174

measure of diversity in their generation. The final diversity of the classifiers is assumed by the nature of the classifier training.

As part of this research, a brief analytical method was designed to test how well several variations of the Bagging and Boosting algorithms would perform when applied to the problem of fold recognition. This section describes the methodologies used and the results obtained during their analysis. In each instance, the resulting ensembles failed to produce any improvement in recognition accuracy, and, in most cases, actually caused a decrease in accuracy.

4.3.1

Bagging Benchmarking

The details of the Bagging algorithm are outlined in § 1.7.3 (page 90). Briefly, Bagging refers to the process by which a sample of finite training data is selected at random (with replacement) and used to train a system. This random sampling is repeated n times, to generate n classifiers (in this case, fold recognition methods), where n is some integer value chosen according to estimated computational feasibility and empirical trade-offs between performance and training time. The final classification of a query is taken as the majority vote of all the individual classifiers.

As noted in § 1.7.3 (page 90), it is difficult to apply this technique to fold recognition, while maintaining a strict criterion of focusing on extremely remote homologues; the size of the training data set becomes too small to be useful. Also, classification of query proteins is not a discrete task (i.e. there is no strict threshold defining classification bins), so combining the test results in a majority voting scheme will not necessarily work effectively.

A small Bagging analysis was done on the B-DHIP recognition algorithm (the single best algorithm from § 3.5, page 142), using the training and testing sets described in § 2.4.2 (page 113), plus an additional two training sets (training sets 2 and

Development and Optimisation of an Enhanced Fold Recognition Ensemble

175

3) constructed in the same way as the original. Only two parameters were optimised: the gap penalty (insertion and deletion opening penalties) and the z-shift. The affine gap penalties were taken to be 1/10 of the gap penalty. Each optimised classifier was built using simplex parameter optimisation, and all tests were performed on the single testing set. In order to stay close to the original Bagging algorithm, the EP values for the test results were averaged (rather than voted) when combined, and then resorted.

The Bagging analysis included the test results from method 012 from the `Dynamic' benchmarking (see Table 3.2, page 143) and the results from running the testing data set with the optimised parameters from training sets 2 and 3. The parameters from training sets 2 and 3 were different from those used in method 012; however, they were almost identical to each other, and similar to those used in method 014 from the `Dynamic' benchmarking. When the three sets of results were combined, there was no significant improvement in accuracy of the ensemble over the individual constituent methods.

4.3.2

Boosting Benchmarking

The details of the Boosting algorithm (specifically, the AdaBoost algorithm) are described in § 1.7.4 (page 91). Briefly, Boosting generates a series of classifiers by training them on a given set of input data (in this case, query proteins); each element in the training set (i.e. each query protein) begins with its own individual weight (usually 1), and the final weight for the classifier is calculated from the sum of the weights of the incorrectly classified training set elements. For the subsequent rounds of training, the individual weight of each query protein is altered according to two factors: the final error rate of the classifier, and whether or not that query was classified correctly. With these reweighted training elements, the cycle continues until the desired number of classifiers has been built. Essentially, the easier training elements are given lower weights and the harder training elements given

Development and Optimisation of an Enhanced Fold Recognition Ensemble

176

higher weights in order to force the new classifiers to focus on the harder elements.

Once all the classifiers have been built, the final classification of a testing query is taken as a weighted majority vote of all the individual classifiers. The weight of each classifier is dependant on its error rate during its training. The overall aim of Boosting is to force an algorithm to learn how to classify training instances that it was previously unable to classify.

Like Bagging, Boosting is difficult to apply to fold recognition. In order to implement the AdaBoost algorithm as part of this research, several conditions were necessary: · Since the classification of query proteins is continuous rather than discrete (i.e. there is no such thing as a strictly `correct' or `incorrect' answer, only degrees of confidence in the correct classification), it was not appropriate to use the standard discrete functions (one function for correct and one function for incorrect queries) to recalculate the query weights. Instead, a single continuous function was needed which would vary according to how accurately each query was classified. To solve this problem, the average precision (AP) of a given query was used to measure its error rate (1 - AP). · Similarly, since the idea of strictly `correct' and strictly `incorrect' queries did not apply to this system (i.e. there was no discrete boundary boundary separating the two), each classifier's error rate was calculated by summing the products of the individual weight and error rate for each query:

Classifier Error =

All Queries

Query Weight × Query Error

(4.1)

· All queries in the training set began the first round of training with weights of 1 and error rates of 0.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

177

· Since the algorithm requires an individual error rate and weight for every query in the training set, it was not appropriate to use the AP of the concatenated training results as the measure of overall accuracy (see Equation 2.3, page 118). Instead the mean-AP (averaging the APs across each query; see Equation 2.4, page 118) was used. · A classifier was built when the simplex found an optimal set of search parameters by minimising its fitness function. The fitness function of the simplex was defined as the classifier error as determined by its search parameters. · Since the correctness of queries cannot be discretely classified, for each new classifier the query weights were recalculated according to their individual error rates:

Query Weightnew = Query Weightold × × Query Errorold where: Classifier Error 1 - Classifier Error

(4.2)

=

(4.3)

· All query weights were then normalised to sum to 1. For the purposes of this analysis, Boosting was tested using the B-DHIP algorithm (see § 4.3.2.1, page 177) and a simple sequence-sequence comparison algorithm (see § 4.3.2.2, page 178). 4.3.2.1 B-DHIP Boosting

The B-DHIP alignment algorithm was chosen because it was the best performing recognition algorithm from the `Dynamic' benchmark (see § 3.5, page 142); the aim was to see if any improvement could be made to its recognition performance by refocusing subsequent training rounds on poorly classified training examples. A simplex was used for each round of training to optimise the search parameters; deletion and

Development and Optimisation of an Enhanced Fold Recognition Ensemble

178

insertion opening penalties were treated as a single parameter (the gap penalty), while the respective affine penalties were taken to be 1/10 of the gap penalty. The only other optimised parameter was the z-shift for the alignment. Overall, ten rounds of Boosting training were performed. However, after the fifth round there was no change in the optimised parameters; the more difficult training queries were weighted more and more heavily with little change in the classifiers' accuracy.

In keeping with the original AdaBoost algorithm (see Algorithm 2, page 92), during testing each classifier, in the final ensemble, is weighted according to its training error. In the original algorithm, the classifiers were weighted by log(1/), where is defined as in Equation 4.3 (page 177). However, as an alternative, weighting the classifiers by 1/ was also tested. Similarly, there was more than one way to assess the accuracy of the final ensemble as more weighted classifiers were added. Since each classifier was trained according to the value of the mean-AP over all the training queries, the natural progression would be to use this metric. However, in order to test the ensembles in a way that would be comparable to the `Dynamic' benchmark, the AP of the concatenated and resorted results was also used. Overall, the two methods of weighting classifiers and the two methods of measuring the accuracy of the final ensembles gave a total of four different assessment combinations. In addition, all four combinations were also assessed by measuring the percentage recall, and the number of individual queries correctly annotated at 95% precision for the concatonated results. The reason for this was that, while these were not the metrics used for optimisation, these measures would still be useful in examining how well the ensembles would perform in a real-world scenario. In each case, and for every metric, the addition of more classifiers to the ensemble had a detrimental effect on accuracy. 4.3.2.2 Sequence-Sequence Boosting

Due to the failure of the Boosting using the B-DHIP algorithm, another attempt was made with a simple sequence-sequence comparison algorithm. This analysis

Development and Optimisation of an Enhanced Fold Recognition Ensemble

179

was performed to see if perhaps the B-DHIP algorithm was too strong to benefit from Boosting, as it requires an intrinsically weak algorithm in order to work. Once again, simplexes were used during each round of training to optimise the search parameters, which consisted solely of an opening gap penalty. Affine gaps were again taken to be 1/10 of the respective gap penalty.

After ten rounds of Boosting training, all results (bar a difference of 0.02 between the first and second rounds) were identical, so no further analyses were performed as any further results would have also been identical.

4.3.3

Discussion

In reviewing the results of the Bagging and Boosting analyses, none of the ensembles constructed produced any significant improvement in fold recognition accuracy over single methods. In fact, adding more classifiers had a noticeably detrimental effect to almost all of the Boosting ensembles.

The relatively small number of training data available not only limited the scope of the Bagging analysis, but also made it impractical as a means of building a working ensemble. During the few possible rounds of training, there was a small variation in the parameters produced by the simplexes, but no difference in accuracy when the individual classifiers were combined.

The B-DHIP algorithm had shown itself to be a relatively strong algorithm in the `Dynamic' benchmark; even adding secondary structure information provided relatively little improvement in its accuracy. As the Boosting algorithm kept producing the same parameter sets from consecutive simplexes, regardless of how heavily the harder training queries were weighted, this suggests that the algorithm's performance had peaked. The harder queries were simply being weighted more heavily without any noticeable effect. This, in turn, suggests that there are limitations to

Development and Optimisation of an Enhanced Fold Recognition Ensemble

180

how well the B-DHIP method can perform under the conditions set for the analysis (i.e. equivalent insertion and deletion penalties, affine gap set to 1/10 of the opening gap, using SCOP30 version 1.65 as a fold library, etc). If this is true, it also suggests that there are some training queries that are unlikely to be accurately recognised under the same conditions, regardless of how heavily they are weighed in a simplex.

Since the AdaBoost algorithm was developed to increase the accuracy of weak classifiers in an ensemble (see § 1.7.4, page 91), it was decided to perform a Boosting analysis on the sequence-sequence comparison method as the benchmarks had shown that this was significantly weaker than B-DHIP. As described, all parameter sets from the training rounds were almost identical, meaning that there would be no increase or decrease in accuracy when the classifiers were combined in an ensemble. These results confirm that AdaBoost is more robust when using a weaker classifier; however, they do not explain the failure of AdaBoost to show any improvement in accuracy.

As noted by Dietterich (2000), methods of building ensembles using training data manipulation (including Bagging and Boosting) work especially well for unstable learning algorithms, i.e. generic learning algorithms whose output classifiers undergo major changes in response to small changes in training data. These methods could potentially classify any given input perfectly, if they were subjected to the appropriate training. Since the training data for a recognition method help to determine its parameters, this particular definition of `unstable' means that the parameters should also influence the underlying mechanisms of the algorithm. For example, decision-tree and neural network algorithms are regarded as unstable, while linear regression and nearest-neighbour algorithms are generally regarded as stable. As mentioned in § 1.7 (page 83), changing the parameters of a neural network alters the fundamental workings of its search algorithm, giving it much greater coverage of the hypothesis space. The recognition methods used in this research do have changeable parameters; however, changing these parameters can never alter the

Development and Optimisation of an Enhanced Fold Recognition Ensemble

181

basic algorithm of each method. As a result, there is no guarantee that a sequencebased dynamic programming recognition algorithm will produce a perfect result for a given query when using a set of globally optimised parameters. This was particularly noticeable in the Boosting training for the B-DHIP algorithm, where weighting the harder training examples more heavily eventually caused a plateau, where no new parameter sets were found. From these results it can be concluded that the main reason for the failure of Bagging and Boosting, from the point of view of this research, is representational; some queries can never be classified by these dynamic programming recognition algorithms because the algorithms themselves do not possess the innate flexibility necessary for correct classification.

A more general conclusion that can be extracted from this analysis is that it is complementarity that is needed for an ensemble to increase its accuracy over any individual classifier; firstly, an underlying complementarity (i.e. a tendency between all the classifiers to be right more often than wrong), and, secondly, a variation of complementarity between various subsets of the classifiers. This is illustrated in Figure 4.1 (page 183). The weight of a particular classifier in an ensemble should not only reflect its error rate; but also its similarity to the pool of classifiers as a whole. This also explains why AdaBoost works best for binary classifiers that have at least a 50% accuracy rate: two classifiers that are at least 50% accurate are guaranteed to be complementary for at least one input query, and are very likely to be different for many others.

The need for flexibility of classifiers in an ensemble explains why a typical Meta server may work without initial training, but the above analyses fail under controlled conditions. By combining many disparate methods into a Meta server, what is essentially being achieved is a form of pseudo-Bagging/Boosting, i.e. the necessary flexibility of classifiers needed in an ensemble is emulated by artificially using many fundamentally different algorithms. Therefore, greater success may be seen when an ensemble is built from the various algorithms used in the `Dynamic' benchmark

Development and Optimisation of an Enhanced Fold Recognition Ensemble

182

(e.g. correlation coefficients, vector dot-products, structure specific penalties, etc) because they are more likely to share the necessary variation of complementarity, even if their basal level of accuracy is relatively poor.

The failure of the Bagging and Boosting analyses show how difficult it is to build an enhanced fold recognition ensemble in a logical and systematic way. However, the success of the Meta servers in the CASP evaluations proved that the ultimate goal of this research was possible when simply combining a series of successful individual servers and analysing their outputs using some form of clustering algorithm. Since the problem of logical ensemble design is an ongoing issue (and far beyond the scope of this work), the remainder of this chapter will focus on methods used to build effective ensembles, purely for the purposes of enhanced recognition, rather than attempting to find an underlying rationale for how they work.

4.4

CASP-like Training and Testing

The training and testing data sets used in this research were constructed from the ASTRAL compendium using SCOP30 (version 1.65) with no pairs sharing more than 30% sequence identity (see § 2.4.1, page 109). There were 105 query training sequences (with a total of 1,124 QT relationships) and 50 query test sequences (with a total of 247 QT relationships).

By examining the nature of some of the more popular Meta servers, a consistent pattern emerges: most of them use some form of structure-based clustering of models from component classifiers in order to determine their final answer. This follows on from the assumption that `there are more ways to be wrong than there are to be right' because structurally clustering similar correct results will synergistically improve their final scores. In addition, any results that are incorrect are likely to be structurally different from every other model in the pool and, therefore, will not benefit through of clustering.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

183

Classifier 1 2 3 4 5 6 7 8 9 10

Query 1 X

Query 2 X X

Query 3

Query 4 X X X

Query 5 X X

Query 6

X X X X X X X

X X X X

X X X X

X X X

Figure 4.1: An illustrated example of complementarity between classifiers in an ensemble.

Boxes marked with an `X' signify that a given classifier has classified a given query correctly. All incorrect classifications are assumed to be different from each other. The results for Query 1 and Query 2 illustrate the need for underlying complementarity between classifiers. Classifier subset {1, 4, 5, 7, 9} correctly classifies Query 1 but not Query 2. Classifier subset {2, 3, 6, 8, 10} correctly classifies Query 2 but not Query 1. For both Query 1 and Query 2, there is a correct majority vote from the ensemble of all ten classifiers that none of the individual classifiers would have found. The results for Query 3 to Query 6 illustrate the need for variation of complementarity between classifiers. For these Queries, the ensemble can never be better than classifier 10 because no classifier correctly classifies Query 3. In order for an ensemble to improve overall classification accuracy, each Query must be correctly classifiable by at least one classifier.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

184

A major problem when assessing the synergistic effect of a classifier ensemble, is the issue of trivial answers. For example, a given query protein may be scanned against a template protein databank, which contains five correct homologous templates, using a particular fold recognition system from an ensemble. One homologous template (T1 ) may be trivial to find at high confidence, while the others (T2 ­T5 ) may be much harder to identify. In an ensemble that utilises structural clustering, as part of its recognition algorithm, templates T2 ­T5 are now easily identifiable by virtue of their structural similarity to T1 . As a result, the harder templates are found simply because there is an easy template, of similar structure, in the results of one of the constituent methods of the ensemble. It becomes impossible to distinguish between any improvement in recognition accuracy gained through either the ensemble as a whole or the presence of trivial solutions in one of the constituent methods of the ensemble.

In order to solve this problem, and thereby build an ensemble that would be capable of recognising the correct folds for difficult queries (i.e. QT pairs that would otherwise not be found), the training and testing sets originally designed for the `Dynamic' benchmark (see § 2.4.2, page 113) were altered. In order to focus this research on the most remote homologies, and to examine how ensembles perform when none of their constituent methods can make a confident assignment, specific QT relationships found by individual recognition algorithms above an EP value of 0.95 (equivalent to 95% confidence) were ignored. This is not to suggest that the given QT relationships were ignored across all recognition algorithms, they were only ignored in the individual methods in which they could be confidently identified (see Figure 4.2, page 186). In order to achieve this, all of the correct training and testing results (for each of the 31 recognition methods) with EP values of 0.95 or above were reset to an EP value of 0. It is important to note that, even after all trivial correct homologous QT relationships were removed from across all individual recognition algorithms, all the 247 correct homologous relationships in the testing

Development and Optimisation of an Enhanced Fold Recognition Ensemble

185

set were still present within the results pool. Therefore, it was not necessary to remove any of the query proteins from the benchmark.

In order to provide a rigorous test for the ensemble systems, all false positives with EP values of 0.95 or above were allowed to remain in order to see if the ensemble would be able to distinguish them from true positives (see Figure 4.3, page 187). These CASP-like training and testing data (in homage to the CASP evaluation) allowed for a blind analysis of the use of structural information in ensembles, and a test of whether an ensemble is truly able to identify the more difficult QT pair relationships. This benchmark was designed to reflect the `real-world' situation of difficult structure prediction targets, where no individual system provides a confident answer. All ensembles were trained on the CASP-like training set, and then tested using the CASP-like testing set. As a result, any homologous relationships that were confidently detectable by the ensemble were identified solely because of the ensemble, i.e. due to the combined effect of multiple weak predictions and not because of an accurate individual method. The best performing ensembles were also tested using the full testing set (i.e. the standard test data still containing all the trivial answers), to see how well they performed in comparison to the individual fold recognition algorithms and PSI-BLAST.

4.5

Support Vector Machine Clustering

A popular and powerful method of generating an ensemble is to use support vector machines (SVMs; see § 1.7.5, page 93). The power of SVMs lies in their ability to learn complex associations with low risk of over-learning for a given training set. Their speed and efficacy make them an ideal means for combining fold recognition data into an ensemble. For these reasons it was decided to perform several analyses using SVMs to assess whether they could be used to build effective fold recognition ensembles using the data generated during the `Dynamic' benchmark (see Table 3.2, page 143).

Development and Optimisation of an Enhanced Fold Recognition Ensemble

186

Correct or Incorrect

M1

M2

M3

M4

Q 1 T 11 Q 1 T 12 Q 1 T 13 Q 2 T 21 Q 2 T 22 Q 2 T 23 Q3 T 3 1 Q3 T 3 2 Q3 T 3 3

Correct

Correct

Incorrect

Correct

Correct

Correct

Correct

Incorrect

Correct

Figure 4.2: Illustration showing the construction of CASP-like data. Several queries (Q1 ,

Q2 , and Q3 ) are listed against four different recognition methods (M1 , M2 , M3 , and M4 ), along with some of the templates to which the queries are matched. Table cells filled with black crosses represent correct homologous query-template (QT) relationships, which have been identified at empirical precision (EP) values of 0.95 (or above) by the respective recognition methods. These results are ignored (i.e. their EP values are reset to 0). Table cells filled with green circles represent correct homologous QT relationships (correct examples) that have been identified at EP values below 0.95 by the respective recognition methods. These results remain unchanged. Note that the same homologous QT relationships may be ignored in one recognition method, but left unchanged in another. Table cells filled with red circles represent non-homologous QT relationships (incorrect examples). These remain unchanged regardless of whether they were identified above or below an EP value of 0.95.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

187

0.95 EP Boundary

Increasing EP values M1 M2 M3 M4

Figure 4.3: An illustration of CASP-like data. A single query has been scanned against a single database of different templates, using four different recognition methods (M1 , M2 , M3 , and M4 ). All correct (i.e. homologous) query-template (QT) matches are shown in green, and all incorrect (i.e. non-homologous) QT matches are shown in red. In order for the data to be `CASP-like' (i.e. to make the learning algorithm more difficult), so that the effect of the ensemble is not skewed by the presence of trivial answers, all homologous QT pairs above the 0.95 empirical precision (EP) threshold (95% confidence) are reset to 0 EP (signified by black crosses over the green circles). However, all non-homologous QT pairs above the 0.95 EP threshold are left unchanged. The purpose of this is to force an ensemble to learn how to distinguish homologous matches from non-homologous matches when the homologous templates are not easily found.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

188

The typical format for a single instance of SVM input data (training or testing) is a fixed-length vector of features. A feature is usually a numerical value that represents a characteristic metric of some kind. For example, a training set of vectors representing specific QT pairs are labelled as either correct or incorrect; each vector in the training set must have a fixed number of features. These features may include: the alignment score between the query and the template sequences for a particular algorithm and parameter set; the length of the query sequence; the length of the template sequence; the size of the alignment; or any other type of information that could potentially be useful. It should be noted that leaving a feature empty (or with a value of 0) is perfectly valid.

An SVM is able to determine which training vectors best represent the boundaries between classification bins (i.e. correct or incorrect in this case), and uses these boundaries to classify any new vectors (usually unlabelled testing data). The result is that SVMs are very efficient classifiers, but essentially remain `black boxes' offering little (if any) insight into the reasons why a particular feature vector should be classified in a particular way.

All SVMs in this analysis were trained and tested on CASP-like data. In addition, they were also tested on the full testing data set for comparison (i.e. the standard test data still containing all the trivial answers). In this research, SVMLight (Joachims, 1999) was used with linear kernels and default parameters.

4.5.1

SVM 1

For the purposes of this research, the EP values determined for each of the optimised recognition algorithms from the `Dynamic' benchmark were ideally suited to be used as input features in an SVM. The simplest inputs used in this analysis consisted of vectors of 31 EP values. These vectors were constructed for every QT pair for the

Development and Optimisation of an Enhanced Fold Recognition Ensemble

189

queries from the training and testing sets. With a training set of 105 queries, each scanned against a fold library of 4,753 templates (not including self matches), this produced a list of 499,065 training vectors. Similarly, for a testing set of 50 queries, this produced a list of 237,650 testing vectors. As mentioned in § 4.4 (page 182), any EP values from a correct QT pair that were 0.95 or above were reset to 0. As a result, many input vectors for structurally similar QT pairs consisted purely of null values. Any false positive results were left unchanged. The way in which the input vectors were constructed is illustrated in Figure 4.4 (page 190).

4.5.2

SVM 2

The second part of this SVM analysis consisted of using structural information in the training and testing vectors to see if it could provide any improvement in fold recognition. For a given query, a Meta server will often take the top ten (or some other fixed number) QT pairs from each of its constituent servers, and cluster the models built from their alignments using a structural superposition algorithm (e.g. 3D-JURY). Unfortunately, simple SVMs cannot perform structural superpositions as part of their kernel function, and they cannot actively compare individual input vectors (i.e. QT pairs) against each other. This second point is important because the power of Meta servers is believed to stem from their ability to cluster models (for a given query) built by different methods and using different templates (see § 4.6.1, page 196). Since basic SVMs cannot perform these comparisons, it was necessary to perform all structural comparisons beforehand and assimilate the results into the input vectors themselves. In order to do this it was necessary to devise a way of fixing the number of structural comparisons that were performed (so that the input vectors for the SVM would be of a fixed-length). It was decided to use (for every query protein) the top template hits found by each of the recognition methods. These models would then be used as a series of background templates (essentially an absolute frame of reference) to which all models for the query could be compared (see below). This method would not only fix the number of structural comparisons,

Development and Optimisation of an Enhanced Fold Recognition Ensemble

190

SVM input vector of EP values for Query (Q1) & Template (T1)

Q1 T1 0.00 0.92 0.00 0.94

M1

Q1 T 2

M1

0.97 0.96 0.95 0.93

M2

T2 T1

M2

M3

T2 T2 T1

M3

M4

M4

0.95 EP Boundary T1 T2

T1 Increasing EP values M1 M2

M3

M4

Figure 4.4: A diagrammatic representation of the first fold recognition SVM. A single query

(Q1 ) has been scanned against a single database of different templates, using four different recognition algorithms (M1 , M2 , M3 , and M4 ). All correct (i.e. homologous) query-template (QT) matches are shown in green, and all incorrect (i.e. non-homologous) QT matches are shown in red. Template T1 is homologous to Q1 , and template T2 is not homologous to Q1 . The support vector machine (SVM) input vectors are constructed by taking the empirical precision (EP) value for each QT pair from each recognition method. For example, matching Q1 to template T1 , for each recognition method, would form one input vector of EP values (since T1 is homologous to Q1 , this would be a positive example); matching Q1 to template T2 , for each recognition method, would form another vector of EP values (since T2 is not homologous to Q1 , this would be a negative example). In order to make the training conditions more difficult, so that the effect of the ensemble is not skewed by the presence of trivial answers, all homologous QT pairs above the 0.95 EP threshold (95% confidence) are reset to an EP value of 0 (signified by black crosses over the green circles). However, all non-homologous QT pairs above the 0.95 EP threshold are left unchanged (CASP-like data; see § 4.4, page 182). The purpose of this is to force the SVM to learn how to distinguish homologous matches from non-homologous matches when the homologous templates are not easily found.

SVM input vector of EP values for Query (Q1) & Template (T2)

Development and Optimisation of an Enhanced Fold Recognition Ensemble

191

but would also allow the SVM to compare different input vectors indirectly, by extrapolation.

The construction of input vectors for this SVM analysis is illustrated in Figure 4.5 (page 193). Briefly, the input vectors were constructed as follows: if a QT pair was found within the results for a single recognition method (i.e. within the top 10 or above a given EP value threshold), then the model for the same QT pair from, all 31 recognition methods, was extracted. This set of 31 models was referred to as the result models for a given QT pair. This was done for every result from every recognition method, and QT pairs found by more than one classifier were only extracted once. In order to standardise the structural information, the top 10 models for each query, from each classifier, were also extracted (31 methods × 10 models = 310 in total per query). These were referred to as the background models for a given query. Each of the 31 result models, for a given QT pair, was compared against each of the 310 background models for the query; making a total of 9,610 structural comparisons per QT pair. The background models essentially acted as a standardised template to which all results for a given query were compared. As a result, it did not matter whether the background models were accurate or correct. This procedure has the distinct advantage that the same process can be recreated for testing data, and prior knowledge of correct and incorrect answers does not need be assumed. Even though MaxSub (see § 1.6.2.1, page 71) was used in the original 3D-JURY algorithm, the method of structural comparison used for this analysis was the TM score (see § 1.6.2.1, page 71). The TM score was used because the scoring function rescales more efficiently, in comparison to MaxSub, as the query and template proteins increase in length; making it a metric that is independent of protein size. There is also evidence to suggest that TM score shows a much stronger correlation to the quality of final full-length models than MaxSub does (Zhang & Skolnick, 2004).

Given the relatively large number of calculations required for this analysis, it

Development and Optimisation of an Enhanced Fold Recognition Ensemble

192

was considered computationally impractical to construct an input vector of 9,610 features for all 736,715 QT pairs, to cover all queries in the training and testing sets. Instead, if a given QT pair was found within specified limits for at least one of the individual recognition methods (e.g. within the top n results or above a given threshold EP value), then it was included in the input data. For the training set queries, only QT pairs that had at least one EP value of 0.7 or above were included: a total of 2,920 input vectors. For the testing set queries, only QT pairs that had at least one EP value of 0.3 or above were included: a total of 3,556 input vectors (the maximum number that was considered computationally practical to produce). These testing data input vectors represented all 247 QT relationships in the testing set. In order to keep the training and testing data CASP-like, any true positive result models, built from a QT alignment that had an associated EP value of 0.95 or above, were ignored (i.e. any of their corresponding TM scores were reset to 0). As a result, many input vectors for structurally similar QT pairs consisted purely of null values -- similar to the data used in the first SVM analysis.

4.5.3

SVM 3 and 4

The third and fourth SVMs used in this analysis were designed to test the theory that fold recognition can be enhanced by including confidence values derived from a standardised scoring framework. If this theory proves to be correct, then it will suggest that ensemble systems such as 3D-JURY can be enhanced by incorporating such information.

The third SVM was identical to the second, but included the EP values for each of the 310 background models used in the construction of the input vectors; this effectively gave the SVM a scale by means of which it could gauge how reliable the TM scores from the structure superpositions were. The fourth SVM was the same as the third, but also included the EP values from the first SVM.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

193

M1

M2

M3

M4

0.00 0.70 0.00 1.00

Top n (e.g. 10) Template hits from each method against the Query

T1 Result models for Query (b)

0.00 0.25 0.00 0.36 0.00 0.48 0.00 0.71 0.00

0.95 EP Boundary

Compare models with TM metric

0.86 0.00 0.97

Background models for the Query (a)

M1

SVM input vector of (a x b) TM scores for Query & T1

M2

T1

M3

M4

T1 0.95 EP Boundary T1

Results models for Query against Template (T1) using four different methods

T1

Increasing EP values

A diagrammatic representation of the second fold recognition SVM. A single query has been scanned against a single database of different templates using four different recognition algorithms (M1 , M2 , M3 , and M4 ). All correct (i.e. homologous) query-template (QT) matches are shown in green, and all incorrect (i.e. nonhomologous) QT matches are shown in red. Template T1 is homologous to the query. The result models for the query against template T1 are shown in the lower half of the diagram; each one represents the same QT pair taken from a different recognition algorithm. The background models are taken from the top n models from each method for the query (in this illustration, n = 10). Each result model is structurally compared to each background model using the TM score metric (see § 1.6.2.1, page 71). Since the support vector machine (SVM) cannot structurally compare the models produced for different templates, the background models represent an absolute frame of reference that the SVM can use for comparison, in order to extract additional structural information from the result models. As in Figure 4.4 (page 190), all homologous QT pairs above the 0.95 empirical precision (EP) threshold (95% confidence) are reset to 0 EP (signified by black crosses over the green circles). However, all non-homologous QT pairs above the 0.95 EP threshold are left unchanged (CASP-like data; see § 4.4, page 182). For this SVM, the TM scores that correspond to the ignored QT pairs are reset to a TM score of 0. All TM scores corresponding to non-homologous QT pairs above the 0.95 EP threshold are left unchanged. The purpose of this is to force the SVM to learn how to distinguish homologous matches from non-homologous matches when the homologous templates are not easily found.

Figure 4.5:

Development and Optimisation of an Enhanced Fold Recognition Ensemble

194

4.5.4

Results and Discussion

The results for the four different SVMs are shown in Table 4.1 (page 196). The comparison between the CASP-like testing data and the full testing data shows some interesting results. 4.5.4.1 CASP-like Testing Results

Examining the CASP-like testing results, the SVMs appear to improve as more information is added to the input feature vectors. The first SVM purposely excluded any structural information, using just the 31 EP values for each QT pair, and was able to find 29 individual queries (out of 50) with 29.7% recall at 95% precision. The second, which used only structural information, showed an improvement over the first (31 individual queries produced 34.8% recall at 95% precision), though this is not a large improvement when one considers the relative size of the input vectors.

This benchmark showed that using structural data alone, with an implied structural clustering, but with no indication of individual result confidence, improved fold recognition accuracy. This was perhaps the best method for using an SVM to emulate a standard 3D-JURY ensemble system. The third SVM used the same input vectors as the second, but added the individual EP values for each of the 310 background models; the number of individual queries correctly annotated at 95% precision rose to 33, and the recall at 95% precision rose to 40.0%. Finally, the fourth SVM merged the data from the first and third SVMs. The subsequent benchmark accurately predicted 45.3% recall and correctly annotated 36 individual queries at 95% precision.

Since SVMs effectively produce classification `black boxes', it is very difficult to draw any definite, logical conclusions from their these results. However, they do demonstrate (fairly conclusively) that the fold recognition accuracy of an ensemble can be enhanced by using a combination of structural information and standardised

Development and Optimisation of an Enhanced Fold Recognition Ensemble

195

confidence measures. 4.5.4.2 Full Testing Results

Interestingly, when the same SVM models were tested using the full testing set, there were noticeable changes in the results when compared to the results of the CASPlike testing. The first SVM appeared to perform slightly better when using the full testing data; however, as the size of the input feature vector increased, the accuracy of the ensemble not only fell, but fell by a disproportionately larger amount. As a result, the SVMs that used smaller input vectors performed better than those that used larger input vectors.

These results were unexpected given the relative success of the SVMs on the CASP-like testing data. They suggested that, as more features were added to the input vectors, the input data from the full testing set and input data from the CASP-like testing set became progressively less consistent, therefore there was a greater chance that classification errors would occur. Since each SVM was capable of classifying a respectable proportion of the CASP-like testing data, it was clear that these results were not caused by over-learning (which is highly unlikely in SVMs). Instead, the results suggested that the SVMs were capable of efficiently learning relationships for specific types of data, and that they were not able to build a generalised model that could distinguish between easy examples and hard examples. Therefore, training an SVM on a CASP-like training set will build a model that can reliably classify similarly CASP-like testing examples. However, if one wishes to classify easy testing examples, a better classifier is more likely to be built using easy training data.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

196

Description of SVM

Input vector size

% Recall at 95% Precision CASP-like testing Full testing

Queries annotated above 95% Precision CASP-like testing Full testing

Result Model EP values (SVM 1) Result Model TM scores (SVM 2) Result Model EP values and TM scores (SVM 3) Result Model EP values and TM scores and Background Model EP values (SVM 4)

31 9,610 9,920 9,951

28.7 34.8 40.0 45.3

34.0 26.3 23.5 21.1

29 31 33 36

31 27 24 22

Table 4.1: CASP-like SVM ensemble benchmarking results. All SVM analyses performed as

part of this research used SVMLight (Joachims, 1999) with linear kernels and default parameters.

4.6

4.6.1

3D-JURY and 3D-COLONY Clustering

3D-JURY

One of the most common consensus methods used in fold recognition Meta servers is the 3D-JURY protocol (see § 1.6.3.2, page 76). In brief, a series of potential protein models are generated by a selection of fold recognition methods. These models are then structurally clustered and a single model, which is the most similar to all the other models, is chosen. This is analogous to a pairwise clustering procedure where the centroid of the largest (or densest) cluster is chosen as the final answer. The following describes the logic behind the protocol. Two different fold recognition systems may detect two distinct QT pairs (QT1 and QT2 ). T1 and T2 may be distinct homologues with highly similar structures; therefore, QT1 and QT2 will structurally superimpose well and contribute to the density of the correct cluster. The clustering procedure uses pairwise MaxSub comparisons between all models being assessed (see § 1.6.2.1, page 71); the final score for a given model is the sum ° of the number of C atom pairs that are within 3.5A of each other after optimal superposition of all the other models, including itself (see Figure 4.6, page 200). In selecting the models for assessment, the 3D-JURY system uses either the single top scoring model from a variety of methods, or the top n models (where n is usually

Development and Optimisation of an Enhanced Fold Recognition Ensemble

197

between one and ten):

n m

Traditional 3D-JURY score =

i j

Sim(Mi,j , M )

(4.4)

where Sim(a, b) is the number of C atom pairs aligned by MaxSub between models a and b after optimal superposition (for greater detail see § 1.6.2.1, page 71), M is the model whose 3D-JURY score is being calculated, Mi,j is the i-th model from the j-th classifier, and m is the total number of classifiers.

The version of 3D-JURY used in this research is a modified version of the one developed by Ginalski et al. (2003); rather than using the sum of aligned C atom pairs as its final score, a normalised value between 0 and 1 is calculated by dividing this score by its highest possible value. The highest possible value for a given model is the number of residues in the query multiplied by the total number of models (including itself) in the 3D-JURY:

n i

3D-JURY score =

Sim(Mi,j , M ) LN × n × m

m j

(4.5)

where LN is the length of the native structure of the query. Clustering with the 3D-JURY protocol has the extra advantage of taking into account multiple results for a given QT pair across many different fold recognition classifiers. When ranking the final scores, the highest individual score for a given QT pair is kept and other, lower scores for that QT pair are ignored. Using this method, not only is the occurrence of structurally similar templates found by the same fold recognition classifier taken into account when ranking them, but also the occurrence of structurally similar templates found in each of the different classifiers.

4.6.2

3D-COLONY

The above version of the 3D-JURY protocol is a refined method of clustering where the final score for a given QT pair is determined by its closeness to the centroid of all

Development and Optimisation of an Enhanced Fold Recognition Ensemble

198

structurally clustered models produced by the ensemble. However, it still does not directly use the confidence measures provided by its constituent recognition methods. By its very nature, the 3D-JURY protocol ignores any measure of confidence given to any of its input models; the main reason for this is that, when used in a typical Meta server, it has to use models, from a variety of sources, that do not use comparable confidence measures. However, in a controlled system, which uses a standardised scoring framework, such as EP values, these confidence measures are scaled and comparable; therefore, it is possible to take advantage of this additional information.

A simple and computationally tractable approach is to use EP scores both as a filter and as weighting terms for input to a 3D-JURY protocol. Thus, extremely low confidence matches are excluded from the ensemble to avoid pollution by noise, and those matches that are permitted into the system contribute to the structural clustering in accordance with the predicted confidence. Following on from this idea, a clustering technique, referred to in this work as 3D-COLONY, was developed. In their paper, Xiang et al. (2002) described an algorithm to account for the shape of the potential energy curve in the evaluation of conformational free energies of loops. They define the term colony energy of a loop structure as the weighted sum of the energies of all the loops (including itself) that are nearby in conformational space. The weight of a given loop is dependent on its RMSD (root-mean-square deviation) from the loop whose colony energy is being calculated. The overall effect is to produce a smoothing function that favours energetic conformations, which lie in broad energy basins, and disfavours conformations in sharply defined troughs in energy space (which are unlikely to be found in nature); essentially it models favourable entropy. The idea is very similar to 3D-JURY; it calculates a score for a given model based on its relationship to other models in nearby conformational space, except that 3D-JURY only takes into account the proximity of surrounding models while ignoring their quality. As an extension of both these techniques, 3D-

Development and Optimisation of an Enhanced Fold Recognition Ensemble

199

COLONY is defined as:

n i m j

3D-COLONY score =

EP (Mi,j ) × Sim(Mi,j , M ) LN × n × m

(4.6)

where EP (a) is the EP value of a. Essentially, 3D-COLONY can be viewed as a weighted 3D-JURY (see Figure 4.6, page 200). The MaxSub metric was used in the 3D-COLONY algorithm in order to make it comparable to the above implementation of 3D-JURY (see § 4.6.1, page 196).

4.6.3

Constructing Ensembles

The greatest challenge in building an ensemble lies in finding the optimal combination of individual classifiers that will produce the most accurate results. This raises many important questions, such as: how many classifiers are required; is there a sufficient amount of variability between classifiers? Is there a point at which the performance of an ensemble plateaus or degrades, as more classifiers are added?

Using the 31 different recognition algorithms from the `Dynamic' benchmark, it is computationally infeasible to perform a large-scale systematic analysis of every possible combination of classifiers that could be used to construct an ensemble. One systematic way to analyse an ensemble is to measure its accuracy incrementally as each individual classifier is included during training, then measure the accuracy of the results from the testing set, using the same ensemble. There are three relatively simple schemes that could be used as approximations: the random approach, the rational approach, and the empirical or greedy approach.

It is important to understand the advantages and disadvantages of each of these approaches in order to select the most appropriate. The random approach is the simplest of the three; methods are selected from a pool and added to the ensemble in no particular order. The rational approach is based on the idea that an anal-

Development and Optimisation of an Enhanced Fold Recognition Ensemble

200

(a)

Pseudo-energy

False positive avoided

Clusters of structures favoured

Conformational Space

(b)

Pseudo-energy

3D ­ COLONY selection

3D ­ JURY selection

Conformational Space

(c)

High similarity (many atoms superpose) Medium similarity Low similarity (few atoms superpose)

Figure 4.6: Schematic illustration of the behaviour of the 3D-JURY and 3D-COLONY ensembles. Black circles represent models produced by the various algorithms in the ensemble. Red squares represent the highest scoring model, as judged by the algorithm in question, i.e. 3D-JURY or 3D-COLONY. (a) The general principles of structural clustering: the synergistic effect of multiple models allows the ensemble to avoid false positives -- i.e. there are more ways to be wrong than there are to be right. (b) 3D-JURY only takes into account the population of an area of conformational space, whereas 3D-COLONY also includes information regarding the energetics or, in this case, the confidence of the fold recognition match. (c) An alternative view of structural clustering in an ensemble: each model's final score is determined by the models that surround it in structural space. Surrounding models, that are closer in structural space, will contribute more to the final score than models that are further away.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

201

ysis of the correlation of errors between every pair of methods in the pool (using a given error statistic) can be performed, and, based on the results of the correlation analyses, the methods can be combined in such an order as maximises their orthogonality and (in theory) the performance of the ensemble. The final approach is the empirical or greedy method of combination. In this approach, the starting point is taken as the single best recognition method from a pool of n recognition methods. The remaining n - 1 recognition methods are combined, in turn, with the first and the best pair is chosen. This best pair is then combined with each of the remaining n - 2 recognition methods until the best triplet is found. This process continues until all methods have been included, or the performance deteriorates.

The problem with the random approach is that it fails to offer any major insight into the workings of the final ensemble; only by repeating the random selection is it possible to learn anything about how the individual recognition methods may contribute to the final combination. Given the limited potential of this approach, it was discarded as a means of analysis.

Superficially, the rational approach appears the best approach to take. However, as discussed in § 4.3.3 (page 179), rational design of ensembles is an ongoing problem. Diversity of methods has been recognised as an important characteristic in classifier combinations (Littlewood & Miller, 1989; Rosen, 1996). However, there is no strict definition of what is intuitively perceived as diversity, dependence, orthogonality, or complementarity of classifiers. A rigorous study by Kuncheva & Whitaker (2003) of 10 different diversity measures showed that there was no clear relationship between any of the diversity measures and the majority vote accuracy of an ensemble. This result can be interpreted as indicating: (i) there is only a weak relationship between diversity and accuracy, (ii) no sufficiently accurate measure of diversity is yet known, or (iii) diversity is a multivariate rather than a univariate concept (Whitaker & Kuncheva, 2003). Whichever interpretation is true, attempting to construct an ensemble by logical means was beyond the scope of this work,

Development and Optimisation of an Enhanced Fold Recognition Ensemble

202

as its main aim was to build the best ensemble in the most practical way.

As a result, the empirical approach was the only viable option. The remainder of this chapter describes the results obtained from using the empirical approach to construct ensembles using the 3D-JURY and 3D-COLONY clustering algorithms. These processes were performed using the queries from the training set. The point at which the single best recall (at 95% precision) was achieved in training ensemble will be refered to as the training peak. Since these analyses assume no prior knowledge of the testing set (as would be the case in a real-world scenario), the training peak was taken to be the point at which the result for the testing ensemble should also be read. If the same level of recall was achieved several times during training, the training peak was taken to be the point at which the ensemble was largest.

4.6.4

Results and Discussion

Since Meta servers usually cluster the top n results from each of their component servers, an analysis was done to test how well the 3D-JURY and 3D-COLONY clustering algorithms performed under the same conditions. Ensembles were built for both clustering algorithms for values of n ranging between one and ten. In addition, ensembles were built using only the models that had corresponding EP values of 0.7 or above (these are referred to as `EP-Threshold' ensembles). This set of tests was performed to see whether the amount and type of information used for each classifier altered the accuracy of the ensembles.

Table 4.2 (page 209) summarises the results from training and testing the ensembles using CASP-like data. The measurements are taken from each of the ensembles at their respective training peaks. Specifically, it shows the accuracy measurements at high precision (95% or above), both for the percentage recall and the number of individual query proteins with at least one correctly identified template. Figure 4.7 (page 213) shows the test results for each of the 3D-JURY and 3D-COLONY ensem-

Development and Optimisation of an Enhanced Fold Recognition Ensemble

203

bles built; each subfigure shows the clustering method used for the testing results of a particular ensemble, indicating how the percentage recall at high precision, and the number of queries correctly annotated at high precision, changes as the ensemble grows. Examining the graphical results shows that, for both 3D-JURY and 3D-COLONY, using the top one to top four input models from each classifier produces similar levels of performance. However, for the ensembles using the top five to top ten input models, 3D-COLONY noticeably outperforms 3D-JURY for both high precision recall and the number of individual queries annotated at high precision. This trend is summarised in Figure 4.8 (page 214). The ensembles with the highest recall at 95% precision were 3D-JURY Top Three (61.1%), 3D-JURY Top Four (59.9%), 3D-COLONY EP-Threshold (58.3%), and 3D-JURY EP-Threshold (58.3%). The ensembles with the highest number of individual queries (out of 50) found at 95% precision were 3D-JURY Top One (42), 3D-COLONY EP-Threshold (41), 3D-COLONY Top One (41), and 3D-JURY EP-Threshold (40).

The results from the various 3D-JURY and 3D-COLONY ensembles were particularly interesting. The sudden drop in accuracy for the 3D-JURY Top n ensembles, when n is five or more, when compared to the 3D-COLONY Top n ensembles (see Figure 4.8, page 214) seems to indicate a fundamental difference in the accuracy of the two algorithms. It is possible that the threshold for this change is dependent on the number of possible correct answers. The testing set queries were selected partly on the basis of each of them having at least two structurally similar superfamily members in the `Dynamic' fold library (with very few having more than two); therefore, it is possible that, for each of the ensembles clustering just the top n results (where n is between one and four), the correct results are always in the majority (or at least equally as frequent as the false positives) when the models are clustered. As soon as the top five results are clustered, the number of correct results are out-numbered by the total number of false positives by a ratio of approximately 2:3. Since the 3D-JURY algorithm fails to take into account the confidence values of the input models it clusters, the overall accuracy of the system falls as more false

Development and Optimisation of an Enhanced Fold Recognition Ensemble

204

positives are added. However, the 3D-COLONY algorithm does take the empirical confidence of its input models into consideration, and so these ensembles are much more robust in the presence of large numbers of false positives. The anomalous results for the 3D-COLONY Top Nine ensemble can be explained by its failure to match the training peak to the testing peak (the training peak is indicated by the red arrow in Figure 4.7(r), page 212, and it clearly misses the point in the testing where the highest recall is achieved); otherwise, the relationship between the percentage recall, and the number of queries correctly annotated at high precision, for the 3D-COLONY ensembles would have been relatively consistent as the number of input models per classifier was increased.

This conclusion opens up the question of whether blind clustering of structures is fundamentally reliable. Evidently such clustering methods are capable of finding various QT relationships that would otherwise be missed by individual fold recognition algorithms. However, as shown by the 3D-COLONY results, existing Meta servers may not be achieving their full potential with regards to the total number of correct QT relationships that they are able to identify. The principle of `there are more ways to be wrong than there are to be right' still seems to be valid; however, potentially useful information, that may help to distinguish between how right or wrong a model is, should not be discarded. This conclusion is also supported by the similarities between 3D-JURY EP-Threshold and 3D-COLONY EP-Threshold results. Even though 3D-JURY does not actively use the EP values of the input models in its algorithm, they are implicitly encoded in the fact that only high confidence models are included in the ensemble. Under these conditions, the 3D-JURY algorithm and the 3D-COLONY algorithm are virtually identical, which is reflected in the fact that the results are also virtually identical.

The key points of this analysis are: · Use of a standardised scoring framework, to implicitly describe the confidence

Development and Optimisation of an Enhanced Fold Recognition Ensemble

205

that a given fold recognition method has in its final models, noticeably increases the final accuracy of a clustering-based ensemble; · Inclusion of additional low confidence input models, into a blind clustering algorithm (such as 3D-JURY), can have a detrimental effect on the final accuracy of the ensemble.

4.7

Final Ensemble Analysis

From the previous analyses, it is clear that there are multifarious factors to be considered when constructing an ensemble: as shown in § 4.5 (page 185) an important characteristic of an effective ensemble should be the ability to work well on both difficult and easy input examples; and § 4.6 (page 196) showed that ensemble construction should be simple and make maximum use of limited resources. However, deciding on the single best ensemble method from a pool of more than 20 is largely subjective -- with two different quality metrics (i.e. percentage recall at high precision, and the number of queries correctly annotated at high precision), a judgement must be made as to which metric should be given priority.

4.7.1

4.7.1.1

Results and Discussion

CASP-like Testing Set

For the purposes of protein fold recognition, the most important metric is the number of queries that are correctly annotated: the number of individual correct homologous query-template (QT) pairs that are identified at high precision is unimportant provided that at least one correct relationship is found for each query. However, for other areas of bioinformatics that utilise structure prediction a greater breadth of coverage may be crucial. For example, when predicting the function of a given query using remote structural homologies, it is better to have sampled templates from a wide range of sequence and structural space. This is because it is possible for

Development and Optimisation of an Enhanced Fold Recognition Ensemble

206

remote templates to share similar structure but have different functions. Therefore, if a large number of possible templates are actively excluded from a fold recognition search, there is a greater chance that the templates that share the same function as the query protein will be missed. Fortunately, when tested on CASP-like data, the 3D-COLONY EP-Threshold ensemble produced large values for both recall and queries annotated at high precision; though for neither metric was it the highest. In order to gauge whether there was any significant difference between the ensembles that were tested using CASP-like data, a McNemar's test (McNemar, 1947) was performed on the results from the SVM ensembles, the 3D-JURY ensembles, and the 3D-COLONY ensembles. The McNemar's test is a simple and standard approach for finding the statistical significance by evaluating the probability of 2 , where: (b - c)2 (b + c)

2 =

(4.7)

where b is the total number of times for which the prediction of an instance in the first method is wrong (i.e. it is a false positive or a false negative) and the prediction of the same instance in the second method is correct (i.e. it is a true positive or a true negative), and c is the total number of times for which the prediction of an instance in the first method is correct and the prediction of the same instance in the second method is wrong. For this research, the Yates' continuity correction (Yates, 1934) was also used to correct for discontinuities and small data values: (|b - c| - 1)2 (b + c)

2 =

(4.8)

The test was performed for all CASP-like testing results: recall of correct homologous QT relationships at high precision, and correct annotation of individual queries at high precision.

The results of the McNemar's tests are shown in Figure 4.9 (page 215). They show that, when considering the number of queries correctly annotated at high precision, there was no significant difference between 3D-COLONY EP-Threshold and

Development and Optimisation of an Enhanced Fold Recognition Ensemble

207

3D-JURY Top One (the only ensemble to annotate a greater number of queries). Similarly, when considering the number of correct homologous QT relationships, there was no significant difference between the 3D-COLONY EP-Threshold ensemble and the 3D-JURY Top Three or 3D-JURY Top Four ensembles (the only two ensembles to identify a greater number of relationships). The results also show that there is no significant difference between the 3D-COLONY EP-Threshold ensemble and the 3D-JURY EP-Threshold ensemble; however, as discussed in § 4.6.4 (page 202), this is probably because even though 3D-JURY does not actively use the empirical precision (EP) values of the input models in its algorithm, they are implicitly encoded in the fact that only high confidence models are included in the ensemble. When given the choice between implicitly or explicitly encoding model confidence into the ensemble algorithm, it is better to choose explicit encoding (i.e. 3D-COLONY). 4.7.1.2 Full Testing Set

When tested on the CASP-like testing set, the 3D-COLONY EP-Threshold ensemble produced a striking performance enhancement: identifying 58.0% of all correct homologous QT relationships, and correctly annotating 41 (out of a possible 50) testing queries. However, the results from the SVM benchmark (see § 4.5, page 185) conclusively illustrated the need for ensembles to be effective on both easy and difficult input examples: as the final part of the ensemble analysis, several of the 3D-JURY and 3D-COLONY ensembles were retested on the full testing set. The results of this analysis are shown in Table 4.3 (page 216).

The final analysis was performed on the single best performing ensemble (3DCOLONY EP-Threshold), and the 3D-JURY and 3D-COLONY ensembles using both the top one and top ten results from each constituent method. The top one and top ten ensembles were chosen for this analysis because they represented the most commonly used types of selection procedure used by internet-based Meta servers. The 3D-COLONY EP-Threshold ensemble was able to detect 64.0% of all correct

Development and Optimisation of an Enhanced Fold Recognition Ensemble

208

homologous QT relationships, and correctly annotate 42 queries, at 95% precision or above. In comparison to the improvement that the single best recognition algorithm (based on the training data -- method 025) has over PSI-BLAST, this represents a 29.6% increase in the number of correct homologous QT relationships, and a 46.2% increase in the number of queries correctly annotated, at high precision.

The above method is the correct approach for comparison of algorithms as such results should be selected based on unseen data. However, given the benefit of hindsight, the single best recognition method according to the testing results (method 030) was able to identify 56.7% of all QT relationships and correctly annotate 38 queries at high precision. Thus, in comparison to the improvement that this method has over PSI-BLAST, the ensemble provides a 20.5% increase in the number of correct homologous QT relationships, and a 26.7% increase in the number of correctly annotated queries.

Despite the high number of correctly annotated queries achieved by the 3DCOLONY EP-Threshold ensemble on the full testing set, it was still not as high as the 3D-COLONY Top One or the 3D-JURY Top One ensembles (though only by one and two queries respectively); however, a McNemar's test between the results from 3D-COLONY EP-Threshold and the other two ensembles showed that there was no statistically significant difference between the values (see Table 4.3, page 216). As mentioned previously, the clustering ensembles that use the single best result from each of their constituent recognition methods are intrinsically limited in their coverage of homologous relationships; therefore, even though they may be able to identify one or two correct homologies in order to annotate a given query, they cannot provide the breadth of coverage that is required for other bioinformatics tasks such as function prediction. The conclusion that can be drawn from this analysis is that the 3D-COLONY EP-Threshold ensemble is the overall best performer, since it performs well under both metrics and not significantly differently from the ensembles that produce the best individual results.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

209

Description of Ensemble

% Recall at 95% Precision

Queries annotated above 95% Precision

EP-Weighted 3D-JURY Top 1 3D-JURY Top 2 3D-JURY Top 3 3D-JURY Top 4 3D-JURY Top 5 3D-JURY Top 6 3D-JURY Top 7 3D-JURY Top 8 3D-JURY Top 9 3D-JURY Top 10 3D-JURY EP-Threshold 3D-COLONY Top 1 3D-COLONY Top 2 3D-COLONY Top 3 3D-COLONY Top 4 3D-COLONY Top 5 3D-COLONY Top 6 3D-COLONY Top 7 3D-COLONY Top 8 3D-COLONY Top 9 3D-COLONY Top 10 3D-COLONY EP-Threshold

28.3 38.5 54.7 61.1 59.9 35.6 38.7 36.4 38.5 33.6 27.9 58.3 40.1 44.1 56.7 55.9 46.6 56.3 55.5 50.2 43.0 47.8 58.3

6 42 37 37 37 17 16 14 15 15 11 40 41 33 38 32 29 31 31 26 16 24 41

Table 4.2: CASP-like 3D-JURY and 3D-COLONY ensemble benchmark results. The table

summarises the measurements taken from each of the ensembles, at their respective training peaks, when tested on CASP-like data. Specifically, it shows the accuracy measurements at high precision, both for the percentage recall and the number of individual query proteins with at least one correctly identified template. For a full description of each ensemble, see § 4.6.4 (page 202).

Development and Optimisation of an Enhanced Fold Recognition Ensemble

210

80 75 70 65 60 55 50

50 45 40 35

80 75 70 65 60 55

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 031 010 009 026 013 021 023 002 006 022 020 015 016 014 019 005 017 012 001 025 011 004 018 030 007 028 003 027 008 029 024

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 010 012 013 026 014 031 020 021 006 001 022 019 015 008 016 011 023 005 018 002 025 007 024 029 030 017 028 003 004 027

Added Methods

Added Methods

(a) 3D-JURY Top One

80 75 70 65 60 55 50 35 50 45 40 80 75 70 65 60 55

(b) 3D-COLONY Top One

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 013 010 021 025 014 020 022 006 005 015 016 004 011 018 002 026 001 031 008 012 017 003 028 019 024 023 027 029 007 030

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 010 009 013 021 026 011 005 018 022 020 014 016 025 019 006 012 031 029 002 001 024 008 017 015 023 028 004 003 030 027 007

Added Methods

Added Methods

(c) 3D-JURY Top Two

80 75 70 65 60 55 50 35 50 45 40 80 75 70 65 60 55

(d) 3D-COLONY Top Two

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 013 010 021 011 005 020 001 014 022 031 004 006 019 016 015 002 025 003 026 017 018 012 028 029 007 024 008 027 023 030

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 010 009 012 021 006 031 020 001 014 022 016 018 015 029 013 004 002 019 011 003 026 025 005 017 028 008 023 007 027 030 024

Added Methods

Added Methods

(e) 3D-JURY Top Three

(f) 3D-COLONY Top Three

Queries Found

Queries Found

Queries Found

Development and Optimisation of an Enhanced Fold Recognition Ensemble

211

80 75 70 65 60 55 50

50 45 40 35

80 75 70 65 60 55

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 010 021 013 025 020 006 014 016 002 001 004 022 031 017 015 019 018 011 003 005 008 029 027 028 026 007 024 030 023

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 010 021 006 013 022 002 014 020 018 026 019 001 016 004 031 025 005 011 003 015 017 028 029 024 023 008 007 030 027

Added Methods

Added Methods

(g) 3D-JURY Top Four

80 75 70 65 60 55 50 35 50 45 40 80 75 70 65 60 55

(h) 3D-COLONY Top Four

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 013 010 012 021 020 006 014 002 001 004 019 017 016 015 030 011 031 008 003 005 022 018 025 024 028 026 027 007 029 023

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 013 010 021 006 014 001 017 018 002 031 015 016 020 019 022 011 004 005 008 003 027 028 007 025 026 029 023 030 024

Added Methods

Added Methods

(i) 3D-JURY Top Five

80 75 70 65 60 55 50 35 50 45 40 80 75 70 65 60 55

(j) 3D-COLONY Top Five

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 013 010 021 011 006 002 020 019 031 015 016 014 004 001 012 017 003 008 022 005 018 030 027 025 007 028 026 029 024 023

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 013 010 025 001 006 002 004 020 018 019 015 016 021 026 031 017 014 003 008 011 005 022 028 029 027 007 030 024 023

Added Methods

Added Methods

(k) 3D-JURY Top Six

(l) 3D-COLONY Top Six

Queries Found

Queries Found

Queries Found

Development and Optimisation of an Enhanced Fold Recognition Ensemble

212

80 75 70 65 60 55 50

50 45 40 35

80 75 70 65 60 55

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 011 001 006 021 014 003 010 004 031 012 020 015 017 002 019 016 013 008 018 005 027 022 025 030 026 029 028 007 023 024

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 006 014 021 013 002 025 020 016 031 019 001 015 018 004 010 017 008 003 028 022 011 005 007 027 023 030 026 029 024

Added Methods

Added Methods

(m) 3D-JURY Top Seven

80 75 70 65 60 55 50 35 50 45 40 80 75 70 65 60 55

(n) 3D-COLONY Top Seven

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 021 013 005 002 003 020 015 016 001 011 019 004 017 014 012 006 031 018 010 027 008 030 022 028 025 026 024 029 023 007

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 006 014 020 013 021 002 011 005 015 016 019 018 001 031 017 008 003 004 010 022 027 028 025 007 026 030 023 024 029

Added Methods

Added Methods

(o) 3D-JURY Top Eight

80 75 70 65 60 55 50 35 50 45 40 80 75 70 65 60 55

(p) 3D-COLONY Top Eight

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 021 002 003 006 014 001 031 016 015 019 020 017 004 005 013 027 010 012 008 018 022 030 026 025 024 007 011 028 029 023

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 006 020 002 010 013 003 015 018 016 004 019 031 021 001 017 014 005 026 025 008 022 011 027 028 007 023 030 029 024

Added Methods

Added Methods

(q) 3D-JURY Top Nine

(r) 3D-COLONY Top Nine

Queries Found

Queries Found

Queries Found

Development and Optimisation of an Enhanced Fold Recognition Ensemble

213

80 75 70 65 60 55 50

50 45 40 35

80 75 70 65 60 55

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 021 010 006 004 014 020 016 003 019 015 031 001 002 008 027 017 005 022 013 026 018 012 009 023 007 025 011 024 028 030 029

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 002 021 013 003 001 004 017 020 015 018 016 008 031 014 019 010 006 011 026 005 022 027 007 028 024 025 023 030 029

Added Methods

Added Methods

(s) 3D-JURY Top 10

80 75 70 65 60 55 50 35 50 45 40 80 75 70 65 60 55

(t) 3D-COLONY Top 10

50 45 40 35 30 25 20 15 10 5 0

Queries Found

% Recall

45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 010 013 008 025 022 020 021 011 024 026 031 027 023 019 017 018 014 028 029 005 003 015 007 016 002 006 030 004 001

% Recall

30 25 20 15 10 5 0

50 45 40 35 30 25 20 15 10 5 0 Recall at 95% Precision Queries found at 95% Precision 009 012 010 008 031 013 025 022 021 011 020 016 014 019 030 024 023 026 018 015 028 029 027 017 003 005 004 002 006 007 001

Added Methods

Added Methods

(u) 3D-JURY EP-Threshold

(v) 3D-COLONY EP-Threshold

Figure 4.7: Benchmarking recall at 95% precision using CASP-like 3D-JURY and 3D-COLONY

ensembles. Each subfigure shows the clustering method used in the results for the particular ensemble. The results shown are for percentage recall when using the testing ensemble. The red solid lines show the percentage recall at 95% precision, and the green dotted lines show the number of individual queries correctly annotated at 95% precision as more classifier methods are added to each ensemble, in accordance with the training results. The training peaks are indicated by the vertical red arrows. See § 3.5 (page 142) for a full description of each classifier method. For a full description of each ensemble, see § 4.6.4 (page 202).

Queries Found

Queries Found

Development and Optimisation of an Enhanced Fold Recognition Ensemble

214

80 75 70 65 % Recall Found at 95% Precision 60 55 50 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 Recall with 3D-JURY Recall with 3D-JURY EP-Threshold Recall with 3D-COLONY Recall with 3D-COLONY EP-Threshold 6 7 8 9 10

No. of Top Queries in 3D-JURY/3D-COLONY

(a) Percentage recall found at 95% precision for all 3D-JURY and 3DCOLONY ensembles

50 45 40 Queries Found at 95% Precision 35 30 25 20 15 10 5 0 1 2 3 4 5 Queries with 3D-JURY Queries with 3D-JURY EP-Threshold Queries with 3D-COLONY Queries with 3D-COLONY EP-Threshold 6 7 8 9 10

No. of Top Queries in 3D-JURY/3D-COLONY

(b) Queries correctly annotated at 95% precision for all 3D-JURY and 3D-COLONY ensembles

Figure 4.8: Percentage recall found and queries correctly annotated at 95% precision for all

3D-JURY and 3D-COLONY ensembles.

Development and Optimisation of an Enhanced Fold Recognition Ensemble

215

Figure 4.9: McNemar's test analyses of full testing results from all SVM, 3D-JURY, and

3D-COLONY ensembles. The top right half of the grid shows the results for McNemar's tests examining the difference between the number of correct homologous query-template (QT) relationships identified by the respective ensembles. The bottom left half of the grid shows the results for McNemar's tests examining the difference between the number of queries correctly annotated by the respective ensembles. White squares indicate that there is no statistically significant difference between the two ensembles. Images created with matrix2png (Pavlidis & Noble, 2003).

Development and Optimisation of an Enhanced Fold Recognition Ensemble

216

Description PSI-BLAST

Single best recognition method from `Dynamic' benchmark training (025) Single best recognition method from `Dynamic' benchmark testing (030)

% Recall at 95% Precision

Queries annotated above 95% Precision

21.1 * 54.2 * 56.7 * 25.5 * 23.0 * 24.7 * 55.9 * 64.0

23 * 36 38 44 41 43 34 * 42

3D-JURY Top 1 3D-JURY Top 10 3D-COLONY Top 1 3D-COLONY Top 10 3D-COLONY EP-Threshold

Table 4.3: Ensemble and single-method benchmarking results using the full testing set. An

overview of results using the full testing set with PSI-BLAST, the single best fold recognition methods from the training and testing benchmark (methods 025 and 030 respectively), the 3DJURY Top 1 and Top 10 ensembles, the 3D-COLONY Top 1 and Top 10 ensembles, and the 3D-COLONY EP-Threshold ensemble. The figures represent the percentage of correct homologous query-template relationships detected at 95% precision (or above), and the number of query proteins correctly annotated at 95% precision (or above). Any value that is labelled with an asterisk is signficantly different from the respective value for the 3D-COLONY EP-Threshold ensemble at the 5% significance level. The tests of statistical significance were McNemar's tests with Yates' continuity correction (see Equation 4.8, page 206).

Discussion

217

Chapter 5 Discussion

5.1 Summary

This chapter examines why the optimal ensemble produced using 3DCOLONY performs so well in comparison to the other ensembles; analyses reveal that the increase in accuracy is due, partly to the underlying quality of the constituent algorithms, but mostly to the effect of filtering out incorrect answers as more algorithms are added to the ensemble. The analyses also show how the use of predicted secondary structure in fold recognition algorithms does not enable the system to detect remote homologues with more similar secondary structure distributions, but, instead, it helps by indirectly allowing the system to detect homologues that are more remote in sequence space.

5.2

Aims and Objectives

The primary objectives of this research were to build upon the lessons of the first five CASP evaluations to develop a new fold recognition server to succeed `3DPSSM', and to attempt to understand the reasons why Meta predictors perform so well (compared to stand-alone methods) so as to be able to further improve their accuracy. The main focus of this research was the development of an enhanced

Discussion

218

method of determining the correct structural fold for a given query protein sequence, using an ensemble of optimised recognition algorithms. This research focused on developing a better way of identifying suitable template proteins from a databank of known structures; other features necessary to build a successful fold recognition server, such as domain boundary determination and model refinement, were not addressed.

5.3

Final Analysis

The final ensemble chosen for use in the `Phyre' server (3D-COLONY EP-Threshold) is capable of striking performance enhancements: when tested on CASP-like data, 41 (out of a possible 50) queries were correctly annotated, and 58.3% of all correct homologous query-template (QT) relationships identified, at 95% precision or above. When tested using the full testing set, the ensemble can detect 64.0% of all correct homologous QT relationships, and correctly annotate 42 query proteins. The single best recognition algorithm from the `Dynamic' benchmark (method 025; see § 3.5, page 142) was able to identify 54.2% of all correct homologous QT relationships, and and annotated 36 query proteins, when tested on the full testing set at high precision. Under the same conditions, PSI-BLAST was able to identify 21.1% of all correct homolgous QT relationships and correctly annotate 23 query proteins. In comparison to the improvement that the single best recognition algorithm has over PSI-BLAST, the `Phyre' ensemble identifies 29.6% more correct homologous QT relationships, and correctly annotates 46.2% more queries, at high precision.

5.3.1

Improved Precision

When trying to decipher the source of the performance gain achieved by using an ensemble, a clear pattern emerges from analysing the frequency with which different methods find the same individual QT pairs (see Figure 5.1, page 220). To perform this analysis, all QT pairs from the full testing set were examined, and the number

Discussion

219

of times they appeared above a given threshold (0.7 EP), across all 10 recognition methods within the `Phyre' ensemble, was recorded. Any QT pairs with a frequency of zero were ignored. When analysing the ensemble, it can be seen that, for true positive matches (i.e. correct homologous relationships of similar structure) a large number of QT relationships are found by all 10 methods, while only a small number of relationships are found by fewer methods. In contrast, for the false positive matches (i.e. erroneous hits) the vast majority of relationships are uniquely found by individual methods (i.e. each individual method finds different false positives). This illustrates one of the major contributing factors of the improvements seen in ensembles; it is not so much a case of different methods finding different homologies, but more that the mistakes made by individual methods are rarely made by the rest of the ensemble. Mistakes are often unique or rare, and, as a result, the ensemble can effectively filter out false positive results.

5.3.2

Improved Model Quality

An additional area of interest is the difference in the quality of the final models produced by the optimal ensemble system, when compared to those produced by the single best recognition algorithm according to the testing data (method 030 from Table 3.3, page 151). To examine the differences between models produced by these two means, each of the models for the 50 testing set query proteins, produced at 95% confidence or above (for both the single best method and the optimal ensemble system), were compared to the true structure of the respective query protein. The 95% confidence point was used, instead of the 95% precision level, in order to make model selection as close to a `real world' scenario as possible. Figure 5.2 (page 221) shows the ensemble system produced more correct matches above 0.95 EP (155 compared to 124 from the single best method), and how most of the extra answers were between 0­2 ° and 2­4 ° RMSD to the actual answer. A Student's A A T-test performed on the distribution of both sets of models, split into 1 ° RMSD A bins, shows a significant difference between the distributions at the 2% level. Thus

Discussion

220

600

True Positives False Positives

500

Number of Occurrences

400

300

200

100

0

1

2

3

4

5

6

7

8

9

10

Number of Methods that Share a Match

Figure 5.1: Histogram illustrating the difference between the cumulative frequency distributions of true positives and false positives, shared across the 10 constituent algorithms in the best ensemble system, during testing. Query-template pairs with an Empirical Precision score of 0.7 or more were included in the analysis, i.e. reasonably confident matches.

Discussion

221

80 70 60

Single Best System Best Ensemble

Number of Models

50 40 30 20 10 0

0-2

2-4

4-6

6-8

8-10

RMSD of Model (Angstroms)

Figure 5.2: Histogram showing the RMSD (root-mean-square deviation) of the models produced at 0.95 Empirical Precision (95% confidence) by the best ensemble system compared to those produced by the single best recognition algorithm. It can be seen that (overall) more models are produced by the ensemble (155 compared to 124 from the single best algorithm), and that these tend to be in the higher accuracy ranges of 0­2 ° and 2­4 °. Using bin sizes of 1 °, a Student's A A A T-test shows a significant difference between the distribution of both sets of models at the 2% level.

the ensemble is able to enhance fold recognition to model remote homologies at a resolution that is high enough to be used in biological and structural analyses.

5.3.3

Sequence/Structural Features of Remote Homologies

A similar analysis was performed to see if part of the success of the `Phyre' ensemble could be explained by the nature of the 10 constituent algorithms. As before, every QT pair found by each of the 10 methods, at 0.7 EP (or above), was examined. It was expected that methods used by the ensemble, which included predicted secondary

Discussion

222

structure as part of their scoring function, would have a greater tendency to detect matches with greater secondary structure similarity than sequence-only methods. In testing, 64 QT pairs were uniquely found by methods using predicted secondary structure information; each of these QT pairs was structurally superimposed using MAMMOTH (Ortiz et al., 2002) and the percentage identity of secondary structure elements in these alignments was measured, i.e. the frequency with which helical residues were matched and a strand residues were matched between a query and template protein. The average secondary structure percentage identity of these 64 pairs was 64% (standard deviation of 11%), and the average sequence identity was 12% (standard deviation of 4%). Interestingly, for the QT pairs that were uniquely found by the ensemble methods that use sequence information alone (20 QT pairs), the secondary structure percentage identity was remarkably similar at 66% (with a standard deviation of 12%), whereas the percentage sequence identity was far higher at 21% (with a standard deviation of 13%). A Student's T-test performed on both sets of QT pairs showed no significant difference between the percentage secondary structure identities, but it did show a significant difference between the percentage sequence identities at the 0.5% level. Therefore, it would appear that including predicted secondary structure information in a fold recognition system does not enable that system to detect remote homologues with more similar secondary structure distributions as would have been expected. Instead, its power appears to lie in its indirect ability to detect homologues that are more remote in sequence space.

Secondary structure is represented in the ensemble system as a three-state vector of probabilities, and only predicted secondary structure is actually used. The secondary structure is predicted using PSIPRED (Jones, 1999a), which itself uses a window of 15 residue profile vectors as input to its neural network. Thus, one possible explanation for the observed results is that the three-state vector is capturing non-local properties of the surrounding sequence, and that these properties may be conserved, even when a particular amino acid position has very different mutational propensities to its aligned partner.

Conclusions

223

Chapter 6 Conclusions

6.1 Summary

This thesis has shown how the application of ensembles of fold recognition classifiers to remote homology detection can result in dramatic improvements in precision and recall. It is well known that simply pooling the results from a randomly selected set of classifiers is often far from optimal and sometimes produces inferior results to those of an optimised single method. One of the most difficult aspects of ensemble construction is the selection of the optimal subset of available methods to achieve maximum performance. This is because different methods are correlated to varying degrees in their output, which, in turn, has to be balanced with the individual baseline accuracies of each method. This complex compromise between accuracy and diversity in an ensemble has not been satisfactorily researched to date. However, this work demonstrates that the use of either greedy subset selection algorithms or SVMs (support vector machines) can help ameliorate this problem. SVMs are capable of capturing higher order features generated by combinations of methods and can automatically derive the appropriate weighting factors. However, they do not so readily produce models capable of distinguishing harder examples from easier ones. Greedy selection algorithms can give quick heuristic solutions to the combinatorial search of all possible subsets of methods so that accuracy and diversity are

Conclusions

224

reasonably balanced.

Structural clustering, such as that used in 3D-JURY and 3D-COLONY, is clearly a powerful tool: in comparison to the improvement that the single best fold recognition algorithm (method 025) had over PSI-BLAST, the final `Phyre' ensemble identifies 29.6% more correct homologous query-template (QT) relationships, and correctly annotates 46.2% more queries, when tested at 95% precision or above. It simultaneously harnesses the similarity of templates with the consistency of alignments. The analogy to structural entropy, in the 3D-COLONY energy view of structural clustering, helps to explain, albeit rather abstractly, why this is the case. A more intuitive and direct explanation is that `there are more ways to be wrong than there are to be right'; if two relatively diverse algorithms detect similar templates and generate similar alignments, then this strengthens the prediction. The fact that the component algorithms are selected using a greedy selection procedure means that it is possible to avoid some of the problems of reinforcing false negative predictions. This is also confirmed by the analysis in § 5.3 (page 218), which shows that many true positive matches are shared across methods, while false positives are either uniquely found by one method, or shared by a small subset of methods.

The controlled development of a stand-alone Meta server demonstrates a marked improvement over the ad hoc combinations often used by other Meta server systems. This is illustrated by the rise and fall of performance as methods are progressively added to an ensemble (see Figure 4.7(v), page 213). One conclusion that can be drawn from this is that it is vital to make careful choices in the selection of the subset of classifiers that are to be used in the ensemble, from the pool of classifiers available, or it is possible that the performance of the ensemble will be suboptimal.

The controlled development in this research permitted the use of a standardised scoring framework (the Empirical Precision measure), which, in turn, permitted the use of the 3D-COLONY approach. This approach is superior to the conventional 3D-

Conclusions

225

JURY approach, as illustrated in Table 4.2 (page 209). The 3D-COLONY approach improves upon 3D-JURY by weighting models by their respective confidences; thus a small number of highly confident, self-similar models outweighs a large number of weakly predicted self-similar models. As shown in Table 4.2 (page 209), by simply restricting the input models used in the clustering algorithm to those produced at high confidence, an improvement can be gained over clustering algorithms that use an arbitrary selection of input models.

6.2

Outlook

The next test for the `Phyre' system is participation at the next CASP evaluation; this will provide a complete blind test of the system and allow a quantifiable comparison to take place with existing Meta servers.

Although the techniques described in this work are designed purely to identify protein folds, it is reasonable to expect that similar improvements in performance will be seen when techniques such as 3D-COLONY are applied to other areas of structure prediction, such as ab initio folding-derived decoy sets. Here, once again, a large pool of protein models is generated, each with their own associated energy. Initial trials of this method have suggested it may provide an improvement in decoy discrimination. Similarly, it may be possible to extend the use of these techniques to other fields of bioinformatic research, where ensembles could be used to combine the power of a number of high-quality, disparate computational methods. Initial experiments in relevant fields have already shown how the use of ensembles can help to increase overall accuracy, these include: identification of intron/exon boundaries (Wu & Chen, 2004), protein function prediction (Eisner et al., 2005), and proteinligand docking (Wei et al., 2004). By using the principles derived from this research, it may be possible to enhance accuracy further through the application of methods for the systematic selection of constituent algorithms, coupled with the use of a standardised scoring framework (if the use of such a framework is possible).

Conclusions

226

It is particularly noteworthy that the explicit use of structural information, even in the form of predicted secondary structure, has been shown to significantly increase the ability of individual fold recognition algorithms to detect homologues that are more remote in sequence space (albeit at the expense of finding more closely related homologues; see § 5.3.3, page 221). The performance improvements observed using predicted secondary structures, taken together with the data that indicate that secondary structure percentage identity is not statistically significant to such performance, point to an interesting notion: non-local information about a sequence can be encoded by the use of an additional alphabet (helix, coil, and strand in this case). It may not be necessary to limit such encodings to secondary structures. Instead, it may be possible to use a larger alphabet, which encompasses information that correlates with the entire three-dimensional context of an amino acid. If predicted secondary structure performs well, yet the matches of secondary structures are not statistically significant, then this indicates that it may not be necessary to focus on structural features, but, instead, it may be possible to derive alphabets based on non-local sequence features.

Given the improvements in recognition that occur as a result of using higher level structural data, a natural extension of this research, as the subject of future work, would be to extend the recognition pipeline to include extra structural information, such as disorder prediction, coiled-coil, etc. Similarly, the use of this additional data could be the foundation for improving the system so that it could cope with potential quality issues, such as structure disorder or poorer quality crystal structures. Finally, there is the possibility that the system could be further improved by using a more diverse pool of (already existent) algorithms with a more sophisticated search of the space of possible ensemble components -- such as simulated annealing.

Conclusions

227

6.3

Overall

Since 2000, the use of structure prediction Meta servers has threatened to stifle creative invention by removing any incentive to develop new, stand-alone recognition algorithms. Hopefully, the results described in this work will demonstrate the need for continued analysis of Meta techniques and emphasise the importance of using a variety of high quality individual constituent recognition methods. Therefore, far from signalling the obsolescence of the individual fold recognition technique, it is clear that Meta servers provide fresh motivation for developing high quality, varied, and (above all) novel algorithms.

6.4

Server Availability

The best ensemble system (the 3D-COLONY EP-Threshold ensemble) has been made available to the academic community as part of the `Phyre' protein structure prediction web server (http://www.imperial.ac.uk/phyre/).

Acknowledgements

228

Acknowledgements

I would like to express special thanks to Dr. Lawrence Kelley for his continuous support, patience, and mentorship; Alex Herbert for teaching me more about computing than I could have ever hoped to have learnt on my own; and Prof. Michael Sternberg for guidance and advice from the beginning.

In additional, I thank Dr. Keiran Fleming, Dr. Kate Brown, and Dr. Xiaodong Zhang for their help; everyone in the Imperial College Structural Bioinformatics Group for making my time here a most enjoyable experience; and the Medical Research Council for providing me the means and funding to carry out my research.

On a personal note, I would like to thank Louise Briggs, Anne Wright, Marcos Pedrini, Emma Casale, and Nina Callard for being the best friends anyone could hope to have; Ruth Loeffler and Tim Burness for being there when it really mattered (often on more than one occasion); the staff at the Gloucester Road Sandwich Shop and the High Street Kensington Piano Bar (especially Baz and Will); John Murrell for being the post-grad's guardian angel (in a rugby shirt); and every member of my family for collectively reminding me that if you don't stop to enjoy it, life might just pass you by.

Finally, I need to mention my special thanks to my girlfriend, Frances Housden, who (despite having a proper job and social life) has put her own life on hold (to wait for me to catch-up), while continuously showing saint-like patience with the

Acknowledgements

229

endless hours, the all-night coding benders, the non-stop working weekends, the sleep-deprivation, the deadline panics, my repeated habit of forgetting to eat for days at a time, the DIY jobs that ever remain unfinished, the extra laundry, and my mysterious inability to make Pasta alla Pescatore without making the kitchen smell like a fish market. Without her, all this simply would never have happened.

230

Appendices

Alignment Statistics

231

Appendix A Alignment Statistics

A.1 Derivation of E-values

When performing biological sequence alignments, a critical part of the process is being able to distinguish between alignments that are biologically significant and those that are biologically insignificant. Based on the assumed conservation of biologically important residues within homologous families, the simplest way of quantifying biological relevance is by calculating how likely a pair of sequences, of a given length and composition, are of producing a given alignment score purely by chance.

These statistical measurements can be represented as P-values or E-values. A P-value is the probability of seeing at least one score (T ) greater than or equal to some score (x) in a database search of n sequences; P-values are represented as P (x, n). An E-value is the expected number of biologically insignificant sequence alignments with scores greater than or equal to a score x in a database search of n sequences; E-values are represented as E(x, n). These two measurements are easily obtained from the probability Pr(T x) that any single biologically unrelated sequence alignment scores better than or equal to x.

If T is the optimal ungapped subsequence (i.e. local) alignment score from com-

Alignment Statistics

232

paring two random sequences of lengths m and n, then it can be shown that the distribution of T is well-approximated to an Extreme Value Distribution (EVD), when m and n are sufficiently large (Karlin & Altschul, 1990; Dembo & Karlin, 1991a,b; Dembo et al., 1994). The cumulative distribution function of an EVD is given by:

Pr(T < x) = exp(-e-(x-µ) ) From Equation A.1 Pr(T x) can be easily calculated:

(A.1)

Pr(T x) = 1 - exp(-e-(x-µ) ) Using Equation A.2 the equation for E-values can be defined as:

(A.2)

E(x, n) = n(Pr(T x))

(A.3)

Subsequently, if the expected frequency of finding a biologically insignificant sequence alignments (Equation A.3) is regarded as rare or random, then the P-value can be defined using the probability mass function (PMF) of the Poisson distribution where the mean number of occurrences is 1:

P (x, n) = 1 - en(Pr(T x)) = 1 - e-E(x,n) (A.4)

Karlin & Altschul (1990) determined the appropriate EVD for local ungapped alignments analytically, using results described more fully by Dembo & Karlin (1991a). As seen in Equation A.1, the distribution has two parameters: the characteristic value µ, which can be thought of as the centre of the distribution, and the decay constant, or scale parameter ().

Analytical formulae are available to calculate and µ when considering the EVD

Alignment Statistics

233

formed by the distribution of T : is the unique positive solution for when:

pa pb eS(a,b) = 1

a,b

(A.5)

where S(a, b) is the log-odds score for aligning amino acids a and b based upon the substitution matrix S, and pa is the background probability of amino acid a. The value of µ is dependent on the lengths of the sequences being compared and is given by:

µ = (ln Kmn)/

(A.6)

where K is a constant given by a geometrically convergent series dependent on pa and S(a, b) (Karlin & Altschul, 1990; Dembo & Karlin, 1991a,b; Dembo et al., 1994). It acts as a rescaling factor, which takes into account the non-independence of the number of local alignments between two sequences of lengths m and n. is essentially a natural scaling factor, which ensures that the probabilities contained in the substitution matrix sum to 1. If the values of the substitution matrix S were originally derived as log-odds scores, as in Equation 1.3 (page 30), then: qa,b pa pb

eS(a,b) =

where qa,b is the probability of finding a and b paired together in a biologically significant alignment. When aligning two sequences the final score is the summation of the log-odds scores for each amino acid pairing, which effectively multiplies the combined probabilities of every amino acid match or substitution that takes place:

eS(a,b) =

a,b a,b

qa,b pa pb qa,b pa pb

e

P

a,b

S(a,b)

= ex =

a,b

which can be rearranged to give:

Alignment Statistics

234

e-x =

a,b

pa pb qa,b

This value gives the probability of finding an alignment between two sequences with a final score of x, based on the substitution matrix S. Therefore, the number of biologically insignificant (i.e. unrelated) alignments, of total score x, between two random sequences, of lengths m and n, would be expected to be:

E[x] = Kmne-x

(A.7)

where E[x] is the expected number of unrelated alignments. If the number of unrelated alignments with a final score greater than or equal to x is regarded as random, then the probability of seeing at least one score (T ) greater than or equal to x can be approximately represented as a Poisson distribution with a mean equal to the expected value in Equation A.7:

Pr(T x) = 1 - e-E[x]

(A.8)

It should be clear now why the value of must be the solution to Equation A.5; it is only when this value is used that the probabilities used to construct Equation A.7 will be accurate. An alternative approach to deriving Equation A.8 is described below. For a sequence of length n, let M (n) denote the maximal segment pair (MSP) score, i.e. the highest score obtained from an ungapped Smith-Waterman alignment between two sequences (see § 1.4.4.2, page 36, for more on calculating MSPs). It can be proved that M (n) is of the order (ln n)/ (Karlin et al., 1990). Subtracting this centering value from M (n), it becomes necessary to find the limiting probability distribution for: ~ M (n) = M (n) - (ln n)/ ~ where M (n) is the maximal segment score for a sequence of length n, centered about its order.

Alignment Statistics

235

~ Theorem 1 from Karlin & Altschul (1990) states that M (n) has the close approximating distribution: ~ Pr(M (n) > x) 1 - exp(-Ke-x ) Again K is a multiplicative factor that corrects for the non-independence of possible starting points for local alignments. The number of separate high-scoring segments pairs (HSPs; see § 1.4.4.2, page 36) -- i.e. ungapped alignments with scores exceeding ((ln n)/)+x, where x is a real parameter, that are sufficiently far apart -- is closely approximated by a Poisson distribution with a mean (i.e. expected) score of Ke-x . Thus, the probability of finding k or more distinct segments, with a score greater than or equal to x, is closely approximated by:

k-1

1-e

-y i=0

yi i!

(A.9)

where y = Kne-x (n is added to take into account every possible starting position along the sequence). For k = 1, Equation A.9 reduces to:

Pr(T x) = 1 - exp(-Kne-x ) When considering the same situation for two sequences of lengths m and n, the formula becomes:

Pr(T x) = 1 - exp(-Kmne-x ) Therefore, by equating Equations A.10 and A.2 the result is:

(A.10)

e-(x-µ) = Kmne-x the right-hand-side of which has been previously derived in Equation A.7. Hence, after rearrangement, it is possible to derive Equation A.6:

Alignment Statistics

236

µ = (ln Kmn)/ All the previous equations (and Equation 1.3, page 30) are only true for ungapped local alignments, since the probabilities of indels (insertions or deletions) are not included in the calculations. However, computational analyses have suggested that they can reliably model gapped local alignments as well (Karlin & Altschul, 1990, 1993; Altschul & Gish, 1996; Altschul et al., 2001). The scores from gapped local alignments of randomly generated sequences fit well to EVDs when using a standard substitution matrix (e.g. BLOSUM62) and affine gap penalties (Altschul & Gish, 1996; Altschul & Koonin, 1998). The scale parameters and µ cannot be derived by analytical means for gapped local alignments, but they can be estimated (Mott, 1992, 2000; Eddy, 1997).

A.2

Maximum Likelihood Fitting of Extreme Value Distributions

There are several ways of estimating the scale parameters and µ for an extreme value distribution (EVD); these usually require fitting an EVD to a data set. One of the quickest and most efficient ways of doing this is Maximum Likelihood Fitting, which uses distribution probabilities to empirically fit data to an EVD. An essay by Eddy (1997) gives a complete description of how to do this; this section will give just a brief overview.

The probability density function (PDF) of an extreme value distribution is given by:

Pr(x) = exp[-(x - µ) - e-(x-µ) ]

(A.11)

The likelihood of drawing n samples (xi ) from an extreme value distribution with

Alignment Statistics

237

parameters and µ is:

n

Pr(x1 , x2 , . . . , xn |, µ) =

i=1

exp -(xi - µ) - e-(xi -µ)

which is readily rearranged to give:

n n

Pr(x1 , x2 , . . . , xn |, µ) = exp

i=1

n

-(xi - µ) -

i-1

e-(xi -µ)

The log likelihood log L(, µ) = log Pr(x1 , x2 , . . . , xn |, µ) is:

n n

log L(, µ) = n log -

i=1

(xi - µ) -

i=1

e-(xi -µ)

(A.12)

^ In a maximum likelihood fitting approach, the goal is to find the values and µ that maximise the log likelihood log L(, µ). The most efficient way of doing this ^ is to take partial derivatives from log L(, µ) so that a directed optimisation can be done:

log L = n - µ n log L = -

n

n

e-(xi -µ)

i=1 n

(A.13) (xi - µ)e-(xi -µ) (A.14)

(xi - µ) +

i=1 i=1

^ The maximum likelihood estimates and µ are the solutions to ^

log L

log L µ

= 0 and

^ = 0. By setting Equation A.13 to 0, it can be rearranged to give µ in terms

^ of :

Alignment Statistics

238

n

^ ^ n -

i=1 n

µ e-(xi -^) = 0

^

µ e-(xi -^) = n i=1 n µ e-xi e^ = n i=1 n ^ ^

^

e

^µ ^ i=1

e-xi = n

µ e-^ = ^

^

1 n

n

e-xi

i=1 n

1 -^

^

(A.15)

1 ^ eµ = n

e-xi

i=1 n

^

1 1 µ = - log n

e-xi

i=1

^

(A.16)

^ Equation A.15 can now be substituted into Equation A.14 and altered to give in terms of the samples x1 , x2 , . . . , xn :

Alignment Statistics

239

n - ^ n - ^

n

n

n

(xi - µ)+ ^

i=1 n i=1 n

µ (xi - µ)e-(xi -^) = 0 ^

^

xi +

i=1 n i=1

µ+ ^

µ (xi - µ)e-xi e^ = 0 ^ i=1 ^ n ^ -xi i=1 (xi - µ)e ^µ e-^ ^ n ^ -xi i=1 (xi - µ)e n ^ 1 -xi i=1 e n

^

^

n - ^ n - ^ n - ^ n - ^

xi + n^+ µ

i=1 n

=0 =0

^ n -xi i=1 e

xi + n^+ µ

i=1 n

xi + n^+ µ

i=1 n

n n

n i=1

xi e-xi - n^ µ n ^ i -x i=1 e

^

=0

xi + n^+ µ

i=1 n

1 1 - ^ n

xi +

i=1

^ n -xi i=1 xi e - n^ µ n ^ -xi i=1 e ^ n -xi i=1 xi e =0 n ^ -xi i=1 e

=0 (A.17)

Equation A.17 is `well-behaved' in the vicinity of the root -- i.e. the derivative of the equation exists, so a plotted curve would be smoothly graded. Since the derivative exists, a rapid Newton-Raphson algorithm can be applied to find the root.

In order to use Newton-Raphson, the derivative of Equation A.17 must first be found. Recall from the algebra of derivations that: f g Starting from Equation A.17, let: g·f -f ·g g2

=

n

f=

i=1 n

xi e-xi e-xi

i=1 ^

^

g=

Alignment Statistics

240

and by differentiating f and g with respect to :

n

f =-

i=1 n

x2 e-xi i xi e-xi

i=1 ^

^

g =-

Equation A.17 can now be differentiated with respect to to give:

n i=1

d = d

xi e-xi

2

^

2

-

n ^ -xi i=1 e

^ n 2 -xi i=1 xi e n ^ -xi i=1 e

-

1 ^ 2

(A.18)

To implement the Newton-Raphson algorithm, the key equations are Equations A.16, A.17, and A.18. The algorithm is: · Guess (because the function is smooth, even random guesses should work well). ^ · Apply Newton-Rapshon to find the value for that satisfies Equation A.17: ­ calculate the target function f and its first derivative f at , using Equation A.17 to calculate f and Equation A.18 to calculate f . ^ ­ If f is within some absolute tolerance of 0 (e.g. 10-7 ), then stop. has been found. ­ Else, estimate a new = old -

f f

, and do another iteration.

^ · Use the value of to calculate µ from Equation A.16. ^

A.3

Fitting Censored Data to Extreme Value Distributions

On occasion, it may be desirable to fit only a subset of sample data to an extreme value distribution. For example, the left (low scoring) tail of the distribution formed

Alignment Statistics

241

by the sample data may be contaminated with very poor-scoring sequences that do not conform to the true EVD; therefore, it would counterproductive to include these data. If, a priori, any samples from the data set where xi < c are not included in the fit, then the data set is described as Type I Censored (Eddy, 1997).

If c represents the censoring value and z, the number of censored samples, then z is equal to the number of samples (xi ) where xi < c.

For the observed samples, their probability is still given by Equation A.11 (page 236). For the censored samples, their probability is given by Equation A.1 (page 232) -- the cumulative distribution at c gives the probability of xi < c. Therefore, the probability of a data set of n observed samples and z censored samples is:

Pr(xi , x2 , . . . , xn , xn+1 , xn+2 , . . . , xn+z |, µ) =

n

exp -(xi - µ) - e-(xi -µ)

i=1

exp -e-(c-µ)

z

(A.19)

The log likelihood L(, µ) is then:

n

n

log L(, µ) = n log - ze-(c-µ) -

i=1

(xi - µ) -

i=1

e-(xi -µ)

(A.20)

^ Again, the objective, as in Appendix A.2 (page 236), is to find the values for and µ that maximise the likelihood. The form of Equation A.20, is almost identical ^ to that of Equation A.12 (page 237); therefore, the procedure to follow is almost identical:

Alignment Statistics

242

log L = n - ze-(c-µ) - µ

n

e-(xi -µ)

i=1 n n

(A.21) (xi - µ)e-(xi -µ)

i=1

n log L = + z(c - µ)e-(c-µ) -

(xi - µ) +

i=1

(A.22)

^ Setting Equation A.21 to 0 and solving µ in terms of gives: ^ 1 1 µ = - log n

n ^ -c ^

ze

+

i=1

e-xi

(A.23)

By substituting Equation A.23 into Equation A.22, and setting Equation A.22 to 0, gives the target equation: 1 1 - ^ n

n

xi +

i=1

zce-c + ^ ze-c +

^

^ n -xi i=1 xi e n ^ -xi i=1 e

=0

(A.24)

Finally, the first derivative of Equation A.24 with respect to is:

d = d

zce-c +

^ ze-c

^

n i=1

xi e-xi

2

^

2

+

n ^ -xi i=1 e

zc2 e-c + - ^ ze-c +

^

^ n 2 -xi i=1 xi e n ^ -xi i=1 e

-

1 ^ 2

(A.25)

Given n observed data samples xi=1,2,...,n , the number of censored samples (z) ^ and the censoring value (c), the maximum likelihood estimates for and µ can be ^ solved for the censored data using the same Newton-Raphson method used for the uncensored data. This can be done by simply substituting Equations A.23, A.24, and A.25 for Equations A.16, A.17, and A.18.

Constructing a Profile with PSI-BLAST

243

Appendix B Constructing a Profile with PSI-BLAST

B.1 Multiple Alignment Construction

The process of constructing a profile with PSI-BLAST (Altschul et al., 1997) begins with a multiple sequence alignment (M ) from a BLAST search. All sequences with an E-value better than or equal to a given E-value cut-off (default is 0.01) are potentially included. Sequences that are identical to the aligned portion of the query are removed from M completely, and, for template sequences with very high sequence identity (> 97% in PSI-BLAST version 2.0, and > 93% in PSI-BLAST version 2.1), only a single representative sequence is kept. Pairwise alignment columns that involve gap characters inserted into the query are simply ignored, so M is exactly the same length as the query itself. Since all the alignments in M are local alignments (see § 1.4.2.2, page 27), the columns of M may include varying numbers of sequences, and many columns may include nothing but the query (usually near the termini).

As will be discussed, the profile scores for a given column of the alignment M should not only depend on the residues appearing in the column itself, but also on

Constructing a Profile with PSI-BLAST

244

those in other columns. The first step in calculating this dependency is to prune M to a simpler reduced alignment. This pruning is done independently for each column of M , so the reduced multiple alignment (MC ) will generally vary from one column C to the next. To construct MC , it is first necessary to identify the subset of sequences (R) that should appear in it. For a given column C, a sequence will be included in R if it contributes a residue to M in column C. The columns of MC are then defined to be those columns of M in which all the sequences of R are represented either by a residue or a gap character. The regions of a template sequence that occur outside of the local alignment to the query are not counted as being represented in M . The final, reduced multiple alignment MC has residues or gap characters in every row and column.

B.2

Sequence Weights

When constructing a profile from a multiple alignment, is it a mistake to give all sequences an equal weight; a large set of closely related sequences provides little additional information compared to a single sequence; however, the size of the set may mean that it swamps the results, through sheer numbers, and reduces the influence of a small number of more divergent sequences. One way to avoid this problem is to weight each sequence according to how much extra information it provides. PSI-BLAST uses the weighting scheme from Henikoff & Henikoff (1994) and applies it to MC . It treats gap characters as a 21st distinct character, and any columns consisting of identical residues are ignored in calculating weights. Each residue, in each column in MC , is assigned a weight equal to 1/(r × s), where r is the number of different residues in that particular column and s is the number of times that particular residue type appears in the column. For each sequence in MC , the contributions from each position are summed to give a sequence weight. So, if in a column of MC , r different residues are represented, a residue represented in only one sequence contributes a score of 1/r to that sequence, whereas a residue represented in s sequences contributes a score of 1/(r × s) to each of the s sequences. Finally, the sequences'

Constructing a Profile with PSI-BLAST

245

weights are normalised so that the sum of all the residues across all sequences is equal to 1. It should be noted that, because MC is potentially different for every column in M , the sequence weights for each sequence in MC will be dependent on the value of C. Each column in M will now have a set of observed (i.e. weighted) residue frequencies (fi ) where i refers to the residue type, as well as a raw frequency.

In constructing a profile, not only are a column's observed (i.e. weighted) residue frequencies important, but also the effective number of independent observations it constitutes; a column consisting of a single valine and a single isoleucine carries different information than one consisting of five independently occurring instances of each. As a result, it is necessary to estimate the relative number of independent observations (NC ) constituted by the alignment MC . A simple count of the number of sequences in MC is a poor measure, as 10 identical sequences imply fewer independent observations than do 10 divergent sequences. PSI-BLAST uses a simple estimate for NC ; it takes the mean number of different residue types, including gap characters, observed in the columns of MC (i.e. the number of different residue types in each column of MC , averaged over the number of columns). This value of NC saturates at 21; however, it is a good enough approximation for the purposes of profile construction, as it is not the absolute value of NC that is important, but rather its relative value from one column to the next (see Equation B.2, page 246). NC is essentially the same measure of alignment variability as that proposed by Henikoff & Henikoff (1996).

B.3

Target Frequency Estimation

The individual scores of the profile are calculated in a similar way to those in a substitution matrix (see Equation 1.3, page 30), using the formula log(Qi /pi ), where pi is the background probability for residue i, and Qi is the estimated probability for residue i, to be found in the column of interest. It only remains to define Qi .

Constructing a Profile with PSI-BLAST

246

Given a multiple sequence alignment involving a large number of independent sequences, the estimate of Qi for a specific column should simply be the observed frequency of residue i in that column. However, as well as sequence weighting, other issues that complicate estimating Qi include small sample size (Schneider et al., 1986), and prior knowledge of relationships among residues. PSI-BLAST uses the data-dependent pseudocount method introduced by Tatusov et al. (1994). This method uses the prior knowledge of amino acid relationships described by any given substitution matrix (e.g. BLOSUM62) to generate residue pseudocount frequencies gi , which are averaged with the observed frequencies fi to estimate Qi .

Specifically, for a given column C, gi is constructed using the formula: fj qij pj

gi =

j

(B.1)

where j is every residue other than i, qij is the implicit probability of substituting residue i for residue j (and vice versa), x according to the definition of the original substitution matrix, and pj is the background probability for residue j (see Equation 1.3, page 30). The effect of gi is to take into account the probability that every other residue type (j), that may occur in a given column, is there as a result of it being substituted in place of i. Intuitively, those residues favoured by the substitution matrix, to align with the residues actually observed, receive high pseudocount frequencies. Qi is then estimated by: fi + gi +

Qi =

(B.2)

where and are the relative weights given to observed and pseudocount residue frequencies. In order to reduce the scores in columns where nothing has been aligned to the query sequence, = NC - 1. remains an arbitrary pseudocount parameter; the larger its value, the greater the emphasis given to prior knowledge of residue relationships, compared with the observed residue frequencies. PSI-BLAST uses the empirically derived value of 10 for .

Training and Testing Sets

247

Appendix C Training and Testing Sets

C.1 Training Set

Table C.1: Training set SCOP Unique Identifiers, fold names, and superfamily names. These query proteins were selected from SCOP30, version 1.65, following the filtering procedure outlined in § 2.4.2 (page 113). SCOP Unique ID 15149 16485 18436 18516 Globin-like Histone-fold DEATH domain DBL homology domain (DHdomain) 18747 19127 21649 DNA-glycosylase alpha-alpha superhelix Immunoglobulin-like sandwich

continued on next page

Fold Name

Superfamily Name

Globin-like Histone-fold DEATH domain DBL homology domain (DHdomain) DNA-glycosylase ARM repeat betaImmunoglobulin

Training and Testing Sets

248

continued from previous page

SCOP Unique ID 23157

Fold Name

Superfamily Name

C2 domain-like

C2

domain

(Calcium/lipid-

binding domain, CaLB) 24218 Concanavalin lectins/glucanases 24441 25331 25667 ISP domain OB-fold Reductase/isomerase/elongation factor common domain 26879 26926 27007 28052 Acid proteases Double psi beta-barrel PH domain-like Single-stranded left-handed betahelix 28214 31409 Barrel-sandwich hybrid Flavodoxin-like Single hybrid motif Class I glutamine Acid proteases ADC-like PH domain-like Trimeric LpxA-like enzymes A-like Concanavalin lectins/glucanases ISP domain Nucleic acid-binding proteins Riboflavin synthase domain-like A-like

amidotransferase-like 31553 Ferredoxin reductase-like, CFerredoxin reductase-like, C-

terminal NADP-linked domain 32717 Rhodanese/Cell phosphatase 33078 33726 34067 Thioredoxin fold Ribonuclease H-like motif PRTase-like cycle control

terminal NADP-linked domain Rhodanese/Cell phosphatase Thioredoxin-like Ribonuclease H-like PRTase-like

continued on next page

cycle

control

Training and Testing Sets

249

continued from previous page

SCOP Unique ID 34179

Fold Name

Superfamily Name

S-adenosyl-L-methioninedependent methyltransferases

S-adenosyl-L-methioninedependent methyltransferases Nucleotide-diphospho-sugar transferases alpha/beta-Hydrolases Periplasmic binding protein-like I Thiolase-like

34505

Nucleotide-diphospho-sugar transferases

34638 35662 35986 37904

alpha/beta-Hydrolases Periplasmic binding protein-like I Thiolase-like FAD-linked reductases, C-

FAD-linked

reductases,

C-

terminal domain 38436 38547 FKBP-like Thioesterase/thiol dehydrase-isomerase 38889 40196 Enolase N-terminal domain-like CO dehydrogenase flavoprotein C-domain-like ester

terminal domain Chitinase insertion domain Thioesterase/thiol dehydrase-isomerase Enolase N-terminal domain-like FAD/NAD-linked dimerisation main reductases, doester

(C-terminal)

40528 41111

SH2-like ATPase domain of HSP90

SH2 domain ATPase domain of HSP90

chaperone/DNA II/histidine kinase 41388 41481 DNA clamp ATP-grasp

topoisomerase

chaperone/DNA II/histidine kinase DNA clamp Glutathione

topoisomerase

synthetase

ATP-

binding domain-like 41708 Protein kinase-like (PK-like) Protein kinase-like (PK-like)

continued on next page

Training and Testing Sets

250

continued from previous page

SCOP Unique ID 41748 41878

Fold Name

Superfamily Name

FAD-binding domain Ntn hydrolase-like

FAD-binding domain N-terminal nucleophile aminohydrolases (Ntn hydrolases)

42064 43116 43788 44389 44810 44823

Metallo-dependent phosphatases DNA/RNA polymerases Transmembrane beta-barrels Snake toxin-like Cystine-knot cytokines Complement ule/SCR domain control mod-

Metallo-dependent phosphatases DNA/RNA polymerases Porins Snake toxin-like Cystine-knot cytokines Complement ule/SCR domain RING/U-box Sialidases (neuraminidases) FAD/NAD(P)-binding domain MHC antigen-recognition domain gamma-Crystallin-like SH3-domain Anticodon-binding Class II aaRS domain of control mod-

45325 59694 59869 59907 60317 60436 60624

RING/U-box 6-bladed beta-propeller FAD/NAD(P)-binding domain MHC antigen-recognition domain gamma-Crystallin-like SH3-like barrel Anticodon-binding domain-like

60939

Class II aaRS and biotin synthetases

Class II aaRS and biotin synthetases Spectrin repeat LDH C-terminal domain-like

continued on next page

60953 61401

Spectrin repeat-like LDH C-terminal domain-like

Training and Testing Sets

251

continued from previous page

SCOP Unique ID 62453

Fold Name

Superfamily Name

UDPGlycosyltransferase/glycogen phosphorylase

UDPGlycosyltransferase/glycogen phosphorylase Serpins EF-hand Carbohydrate phosphatase EGF/Laminin

62591 62700 63223 65287

Serpins EF Hand-like Carbohydrate phosphatase Knottins (small inhibitors, toxins, lectins)

65714

Single-stranded beta-helix

right-handed

Pectin lyase-like

65738 65805 65828 65915 66808 66830

Phosphorylase/hydrolase-like Alkaline phosphatase-like Cupredoxin-like Gelsolin-like Nudix Leucine-rich repeat, LRR (righthanded beta-alpha superhelix)

Zn-dependent exopeptidases Alkaline phosphatase-like Cupredoxins Actin depolymerizing proteins Nudix L domain-like

68308 68341 68584 70251

7-bladed beta-propeller HMG-box Ferritin-like Composite domain of metallodependent hydrolases

WD40-repeat HMG-box Ferritin-like Composite domain of metallodependent hydrolases SNARE-like (Trans)glycosidases

continued on next page

70633 71425

Profilin-like TIM beta/alpha-barrel

Training and Testing Sets

252

continued from previous page

SCOP Unique ID 71591 72887 73912

Fold Name

Superfamily Name

SIS domain Lipocalins Anticodon-binding domain of a subclass of class I aminoacyltRNA synthetases

SIS domain Lipocalins Anticodon-binding domain of a subclass of class I aminoacyltRNA synthetases 4-helical cytokines ADP-ribosylation Cytochrome P450 L30e-like Nucleotidylyl transferase

74192 76224 76364 76392 76643

4-helical cytokines ADP-ribosylation Cytochrome P450 Bacillus chorismate mutase-like Adenine nucleotide alpha

hydrolase-like 77177 77181 77643 77656 78482 79387 Homing endonuclease-like GroES-like Acyl carrier protein-like Ribokinase-like vWA-like beta-lactamase/transpeptidaselike 79778 80407 81194 83200 beta-hairpin-alpha-hairpin repeat Phosphoglycerate mutase-like ClpP/crotonase Glyoxalase/Bleomycin resisHoming endonucleases GroES-like ACP-like Ribokinase-like vWA-like beta-lactamase/transpeptidaselike Ankyrin repeat Phosphoglycerate mutase-like ClpP/crotonase Glyoxalase/Bleomycin resis-

tance protein/Dihydroxybiphenyl dioxygenase

tance protein/Dihydroxybiphenyl dioxygenase

continued on next page

Training and Testing Sets

253

continued from previous page

SCOP Unique ID 83447 83555

Fold Name

Superfamily Name

alpha/alpha toroid DNA/RNA-binding bundle 3-helical

Six-hairpin glycosyltransferases "Winged helix" DNA-binding domain Galactose mutarotase-like Periplasmic binding protein-like II PDZ domain-like Sm-like ribonucleoproteins Serum albumin-like Thiamin diphosphate-binding

83619 83765

Supersandwich Periplasmic binding protein-like II

84534 84808 85341 85729

PDZ domain-like Sm-like fold Serum albumin-like Thiamin diphosphate-binding

fold (THDP-binding) 86306 87409 87455 Cystatin-like Galactose-binding domain-like Nuclear receptor ligand-binding domain 87517 87765 SAM domain-like DNA breaking-rejoining enzymes

fold (THDP-binding) NTF2-like Galactose-binding domain-like Nuclear receptor ligand-binding domain SAM/Pointed domain DNA breaking-rejoining enzymes

Training and Testing Sets

254

C.2

Testing Set

Table C.2: Testing set SCOP Unique Identifiers, fold names, and superfamily names. These

query proteins were selected from SCOP30, version 1.65, following the filtering procedure outlined in § 2.4.2 (page 113).

SCOP Unique ID 16293 17087

Fold Name

Superfamily Name

S13-like H2TH domain lambda repressor-like DNA-

S13-like H2TH domain lambda repressor-like DNA-

binding domains 17397 CH domain-like

binding domains Calponin-homology domain, CHdomain

17460 18130 18389 19214 23648

Met repressor-like Retroviral matrix proteins Cyclin-like alpha-alpha superhelix Lipase/lipooxygenase (PLAT/LH2 domain) domain

Met repressor-like Retroviral matrix proteins Cyclin-like ENTH/VHS domain Lipase/lipooxygenase (PLAT/LH2 domain) PEBP-like FMN-binding split barrel dUTPase-like Formate/glycerate dehydrogedomain

23704 25747 28353 31348

PEBP-like FMN-binding split barrel beta-clip Flavodoxin-like

nase catalytic domain-like 33323 33865 Restriction endonuclease-like Aminoacid dehydrogenase-like, Restriction endonuclease-like Aminoacid dehydrogenase-like,

N-terminal domain 35311 Formate dehydrogenase/DMSO

N-terminal domain Formate dehydrogenase/DMSO

reductase, domains 1-3

reductase, domains 1-3

continued on next page

Training and Testing Sets

255

continued from previous page

SCOP Unique ID 35600

Fold Name

Superfamily Name

Chelatase-like

"Helical backbone" metal receptor

37441 37499 38008

IL8-like HIT-like Cystatin-like

Interleukin 8-like chemokines HIT-like Copper amine oxidase, domains 1 and 2

38783 38806

dsRBD-like Eukaryotic type KH-domain

dsRNA-binding domain-like Eukaryotic type KH-domain

(KH-domain type I) 40255 40331 NADH oxidase/flavin reductase Zincin-like

(KH-domain type I) NADH oxidase/flavin reductase Metalloproteases ("zincins"), catalytic domain

40798

Acyl-CoA (Nat)

N-acyltransferases

Acyl-CoA (Nat) BPTI-like

N-acyltransferases

44566 44950 59161 59541 59771 59866

BPTI-like Fibronectin type I module Cytochrome c LuxS/MPP-like metallohydrolase FKBP-like Bifunctional inhibitor/lipid-

Fibronectin type I module Cytochrome c LuxS/MPP-like metallohydrolase FKBP-like Bifunctional inhibitor/lipid-

transfer protein/seed storage 2S albumin 61605 Inhibitor of apoptosis (IAP) repeat

transfer protein/seed storage 2S albumin Inhibitor of apoptosis (IAP) repeat

continued on next page

Training and Testing Sets

256

continued from previous page

SCOP Unique ID 68065

Fold Name

Superfamily Name

Phosphoglucomutase, first 3 domains

Phosphoglucomutase, first 3 domains

68325

Adenine

nucleotide

alpha

Adenine

nucleotide

alpha

hydrolase-like 68857 6-phosphogluconate dehydroge-

hydrolases-like 6-phosphogluconate dehydroge-

nase C-terminal domain-like 70088 71729 Glycosyl hydrolase domain Glutathione S-transferase (GST), C-terminal domain 72771 73720 SH3-like barrel Family A G protein-coupled

nase C-terminal domain-like Glycosyl hydrolase domain Glutathione S-transferase (GST), C-terminal domain Chromo domain-like Family A G protein-coupled

receptor-like 75836 76391 76639 Nucleotidyltransferase alpha/beta knot Sigma2 domain of RNA polymerase sigma factors 78909 Glyceraldehyde-3-phosphate dehydrogenase-like, domain 79189 79211 81255 84568 86449 Cytochrome b5 SMAD/FHA domain RuvA C-terminal domain-like DNA glycosylase CUB-like C-terminal

receptor-like Nucleotidyltransferase alpha/beta knot Sigma2 domain of RNA polymerase sigma factors Glyceraldehyde-3-phosphate dehydrogenase-like, domain Cytochrome b5 SMAD/FHA domain UBA-like DNA glycosylase Spermadhesin, CUB domain

continued on next page

C-terminal

Training and Testing Sets

257

continued from previous page

SCOP Unique ID 86629 87729

Fold Name

Superfamily Name

PIN domain-like MurD-like peptide ligases,

PIN domain-like MurD-like peptide ligases,

peptide-binding domain 88462 Ribonuclease Rh-like

peptide-binding domain Ribonuclease Rh-like

References

258

References

Aizerman, M., Braverman, E. & Rozonoer, L. (1964) Theoretical foundations of the potential function method in pattern recognition learning. Automat. Rem. Control, 25, 821­837. Altschul, S. F., Bundschuh, R., Olsen, R. & Hwa, T. (2001) The estimation of statistical parameters for local alignment score distributions. Nucl. Acids Res., 29, 351­361. Altschul, S. F. & Erickson, B. W. (1986) Optimal sequence alignment using affine gap costs. B. Math. Biol., 48, 603­616. Altschul, S. F. & Gish, W. (1996) Local alignment statistics. Methods Enzymol., 266, 460­480. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403­410. Altschul, S. F. & Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST: a tool for discovery in protein databases. Trends Biochem. Sci., 23, 444­447. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res., 30, 276­280. Barton, G. J. (1996) Protein Structure Prediction: A Practical Approach. Oxford, UK: Oxford University Press.

References

259

Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S. R., Griffiths-Jones, S., Howe, K. L., Marshall, M. & Sonnhammer, E. L. L. (2002) The Pfam protein families database. Nucl. Acids Res., 30, 276­280. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn, R. D. & Sonnhammer, E. L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res., 27, 260­262. Bateman, A., Lachlan, C., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L. L., Studholme, D. J., Yeats, C. & Eddy, S. R. (2004) The Pfam protein families database. Nucl. Acids Res., 32, D138­D141. Bates, P. A., Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. E. (2001) Enhancement of protein modelling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins, 45 Suppl. 5, 39­46. Bates, P. A. & Sternberg, M. J. E. (1999) Model building by comparison at CASP3: using expert knowledge and computer automation. Proteins, 37 Suppl. 3, 47­54. Bauer, E. & Kohavi, R. (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36, 105­139. Bellman, R. (1957) Dynamic Programming. Princeton University Press, Princeton, New Jersey, USA. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L. (2005) GenBank. Nucl. Acids Res., 33, D34­D38. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000) The Protein Data Bank. Nucl. Acids Res., 28, 235­242.

References

260

Bonneau, R., Strauss, C. E., Rohl, C. A., Chivian, D., Bradley, P., Malmstrom, L., Robertson, T. & Baker, D. (2002) De novo prediction of three-dimensional structures for major protein families. J. Mol. Biol., 322, 65­78. Bonneau, R., Tsai, J., Ruczinski, I. & Baker, D. (2001a) Functional inferences from blind ab initio protein structure predictions. J. Struct. Biol., 134, 186­190. Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Rohl, C., Strauss, C. E. M. & Baker, D. (2001b) Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins, 45 Suppl. 5, 119­126. Bork, P. & Gibson, T. J. (1996) Applying motif and profile searches. Methods Enzymol., 266, 162­184. Bowie, J. U., L¨thy, R. & Eisenberg, D. (1991) A method to identify protein u sequences that fold into a known three-dimensional structure. Science, 253, 164­170. Bradley, P., Chivian, D., Meiler, J., Misura, K. M. S., Rohl, C. A., Schief, W. R., Wedemeyer, W. J., Schueler-Furman, O., Murphy, P., Schonbrun, J., Strauss, C. E. M. & Baker, D. (2003) Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins, 53 Suppl. 6, 457­468. Br¨nd'en, C. I. & Jones, T. A. (1990) Between objectivity and subjectivity. Naa ture, 343, 687­689. Breiman, L. (1994). Heuristics of instability in model selection. Technical report Berkeley: Statistics Department, University of California, USA. Breiman, L. (1996) Bagging predictors. Machine Learning, 24, 123­140. Breiman, L. (1999) Combining Artificial Neural Nets. London, UK: SpringerVerlag. Brenner, S. E., Koehl, P. & Levitt, M. (2000) The ASTRAL compendium for protein strcture and sequence analysis. Nucl. Acids Res., 28, 254­256.

References

261

Bryant, S. H. & Lawrence, C. E. (1993) An empirical energy function for threading protein sequence through the folding motif. Proteins, 16, 92­112. Bujnicki, J. M., Elofsson, A., Fischer, D. & Rychlewski, L. (2001) Structure prediction meta server. Bioinformatics, 17, 750­751. Capriotti, E., Fariselli, P., Rossi, I. & Casadio, R. (2004) A Shannon entropybased filter detects high-quality profile-profile alignments in searches for remote homologues. Proteins, 54, 351­360. Chandonia, J. M., Hon, G., Walker, N. S., Lo Conte, L., Koehl, P., Levitt, M. & Brenner, S. E. (2004) The ASTRAL compendium in 2004. Nucl. Acids Res., 32, D189­D192. Chandonia, J. M., Walker, N. S., Lo Conte, L., Koehl, P., Levitt, M. & Brenner, S. E. (2002) ASTRAL compendium enhancements. Nucl. Acids Res., 30, 260­ 263. Chen, Z. (2003) Assessing sequence comparison methods with the average precision criterion. Bioinformatics, 19, 2456­2460. Chivian, D., Kim, D. E., Malmstrom, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C. M. E., Bonneau, R., Rohl, C. A. & Baker, D. (2003) Automated prediction of CASP-5 structures using the Robetta server. Proteins, 53 Suppl. 6, 524­533. Chothia, C. & Lesk, A. M. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J., 5, 823­826. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978) Atlas of Protein Sequence and Structure vol. 5 Suppl. 3,. Washington, DC, USA: Natl. Biomed. Res. Found. pp. 345­353. Dembo, A. & Karlin, S. (1991a) Strong limit theorems of empirical functionals for large exceedances of partial sums of I.I.D. variables. Ann. Probab., 19, 1737­1755.

References

262

Dembo, A. & Karlin, S. (1991b) Strong limit theorems of empirical distributions for large segmental exceedances of partial sums of Markov variables. Ann. Probab., 19, 1756­1767. Dembo, A., Karlin, S. & Zeitouni, O. (1994) Limit distribution of maximal nonaligned two-sequence segmental score. Ann. Probab., 22, 2022­2039. Devos, D. & Valencia, A. (2000) Practical limits of function prediction. Proteins, 41, 98­107. Di Francesco, V., Geetha, V., Garnier, J. & Munson, P. J. (1997) Fold recognition using predicted secondary structure sequences and Hidden Markov models of protein folds. Proteins, 29 Suppl. 1, 123­128. Dietterich, T. G. (2000) Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science pp. 1­15 Springer-Verlag, New York, USA. Domingues, F. S., Koppensteiner, W. A., Jaritz, M., Prlic, A., Weichenberger, C., Wiederstein, M., Floeckner, H., Lackner, P. & Sippl, M. J. (1999) Sustained performance of knowledge-based potentials in fold recognition. Proteins, Suppl. 3, 112­120. Durbin, R., Eddy, S., Krogh, A. & Mitchinson, G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge, UK. Eddy, S. R. (1997). Maximum likelihood fitting of extreme value distributions. http://www.genetics.wustl.edu/eddy/publications/. Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755­763. Efron, B. & Tibshirani, R. (1993) An introduction to the bootstrap. Chapman and Hall. Eisenberg, D., L¨thy, R. & Bowie, J. U. (1997) VERIFY3D: assessment of protein u models with three-dimensional profiles. Methods Enzymol., 277, 396­404. 37

References

263

Eisner, R., Poulin, B., Szafron, D., Lu, P. & Greiner, R. (2005) Improving protein function prediction using the hierarchical structure of the gene ontology. In Proceedings of 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology pp. 354­363 IEEE Computer Society Press, Los Alamitos, California, USA. Fetrow, J. S., Godzik, A. & Skolnick, J. (1998) Functional analysis of the Escherichia coli genome using the sequence-to-structure-to-function paradigm: identificaion of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity. J. Mol. Biol., 282, 703­711. Fischer, D. (2000) Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac. Symp. Biocomput., 5, 119­130. Fischer, D. (2003) 3D-SHOTGUN: a novel, cooperative, fold-recognition metapredictor. Proteins, 51, 434­441. Fischer, D. & Eisenberg, D. (1996) Protein fold recognition using sequence-derived predictions. Protein Sci., 5, 947­955. Fischer, D., Rychlewski, L., Dunbrack, R. L., Ortiz, A. R. & Elofsson, A. (2003) CAFASP3: the third critical assessment of fully automated structure prediction methods. Proteins, 53 Suppl. 6, 503­516. Fleming, K., M¨ller, A., MacCallum, R. M. & Sternberg, M. J. (2004) 3Du GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes. Nucl. Acids Res., 32, D245­D250. Fl¨ckner, H., Braxenthaler, M., Lackner, P., Jaritz, M., Ortner, M. & Sippl, M. J. o (1995) Progress in fold recognition. Proteins, 23, 376­386. Fl¨ckner, H., Domingues, F. S. & Sippl, M. J. (1997) Protein folds from pair o interactions: a blind test in fold recognition. Proteins, 29 Suppl. 1, 129­133.

References

264

Freund, Y. (1990) Boosting a weak learning algorithm by majority. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory pp. 202­216 Morgan Kaufmann., Palo Alto, California, USA. Freund, Y. (1996) Boosting a weak learning algorithm by majority. Information and Computation, 121, 256­285. Freund, Y. & Schapire, R. E. (1995) A decision-theoretic generalization of online learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory pp. 23­37 Springer-Verlag. Frishman, D. & Argos, P. (1995) Knowledge-based secondary structure assignment. Proteins, 23, 566­579. Ginalski, K., Elofsson, A., Fischer, D. & Rychlewski, L. (2003) 3D-JURY: a simple approach to improve protein structure predictions. Bioinformatics, 19, 1015­ 1018. Ginalski, K., Grishin, N. V., Godzik, A. & Rychlewski, L. (2005) Practical lessons from protein structure prediction. Proc. Natl. Acad. Sci. USA, 33, 1874­1891. Ginalski, K., Pas, J., Wyrwicz, L. S., von Grotthuss, M., Bujnicki, J. M. & Rychlewski, L. (2003) ORFeus: detection of distant homology using sequence profiles and predicted secondary structure. Nucl. Acids Res., 31, 3804­3807. Ginalski, K. & Rychlewski, L. (2003) Protein structure prediction of CASP5 comparative modeling and fold recognition targets using consensus alignment approach and 3D assessment. Proteins, 53 Suppl. 6, 410­417. Ginalski, K., von Grotthaus, M., Grishin, N. V. & Rychlewski, L. (2004) Detecting distant homology with Meta-BASIC. Nucl. Acids Res., 32, W576­W581. Gonzalez, C., Langdon, G. M., Bruix, M., Galvez, A., Valdivia, E., Maqueda, M. & Rico, M. (2000) Bacteriocin AS-48, a microbial cyclic polypeptide structurally

References

265

and functionally related to mammalian NK-lysin. Proc. Natl. Acad. Sci. USA, 97, 11221­11226. Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol., 162, 705­708. Gough, J. & Chothia, C. (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucl. Acids Res., 30, 268­272. Gribskov, M. (1994) Profile analysis. Meth. Mol. Biol., 25, 247­266. Gribskov, M., McLachlan, A. D. & Eisenberg, D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. USA, 84, 4355­4358. Gribskov, M. & Robinson, N. L. (1996) Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., 20, 25­33. Gutta, S., Huang, J., Takacs, B. & Wechsler, H. (1996) Face recognition using ensembles of networks. In Proceedings of the 13th International Conference on Pattern Recognition pp. 50­54 IEEE Computer Society Press, Los Alamitos, California, USA. Han, S., Lee, B., Yu, S. T., Jeong, C., Lee, S. & Kim, D. (2005) Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics, 21, 2667­2673. Hansen, L. & Salamon, P. (1990) Neural network ensembles. IEEE T. Pattern Anal., 12, 993­1001. Hegyi, H. & Gerstein, M. (1999) The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol., 288, 147­164.

References

266

Hein, J. (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for a volume of homologous sequences, when the phylogeny is given. Mol. Biol. Evol., 6, 649­668. Henikoff, J. G. & Henikoff, S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comput. Appl. Biosci, 12, 135­143. Henikoff, S. & Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89, 10915­10919. Henikoff, S. & Henikoff, J. G. (1993) Performance evaluation of amino acid substitution matrices. Proteins, 17, 49­61. Henikoff, S. & Henikoff, J. G. (1994) Position-based sequence weights. J. Mol. Biol., 243, 574­578. Henikoff, S., Henikoff, J. G. & Pietrokovski, S. (1999) Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics, 15, 471­479. Holm, L. & Park, J. (2000) DaliLite workbench for protein structure comparison. Bioinformatics, 16, 566­567. Holm, L. & Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123­138. Holm, L. & Sander, C. (1996) Mapping the protein universe. Science, 273, 595­603. Hooft, R. W. W., Vriend, G., Sander, C. & Abola, E. E. (1996) Errors in protein structures. Nature, 381, 272­272. Hornik, K., Stinchcombe, M. & White, H. (1990) Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3, 551­560.

References

267

Huang, F. J., Zhou, Z. H., Zhang, H. J. & H., C. T. (2000) Pose invariant face recognition. In Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition pp. 245­250 IEEE Computer Society Press, Los Alamitos, California, USA. Hubbard, T. J. P., Ailey, B., Brenner, S. E., Murzin, A. G. & Chothia, C. (1999) SCOP: a structural classification of proteins database. Nucl. Acids Res., 27, 254­256. Hughey, R. & Krogh, A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS, 12, 95­107. International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860­921. Jaroszewski, L., Li, W. & Godzik, A. (2002) In search for more accurate alignments in the twilight zone. Protein Sci., 11, 1702­1713. Joachims, T. (1999) Advances in Kernel Methods -- Support Vector Learning. MIT-Press. Jones, D. T. (1998) Computational methods in molecular biology. New York, USA: Elsevier. Jones, D. T. (1999a) Protein secondary structure prediction based on positionspecific scoring matrices. J. Mol. Biol., 292, 195­202. Jones, D. T. (1999b) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. Nucl. Acids Res., 287, 797­815. Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992) A new approach to fold recognition. Nature, 358, 86­89. Jones, D. T., Tress, M., Bryson, K. & Hadley, C. (1999) Successful recognition of protein folds using threading methods biased by sequence similarity and predicted secondary structure. Proteins, 37 Suppl. 3, 104­111.

References

268

Kabsch, W. & Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577­2637. Karlin, S. & Altschul, S. F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA, 87, 2264­2268. Karlin, S. & Altschul, S. F. (1993) Applications and statistics for muliple highscoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA, 90, 5873­ 5877. Karlin, S., Dembo, A. & Kawabata, T. (1990) Statistical composition of highscoring segments from molecular sequences. Ann. Stat., 18, 571­581. Karplus, K., Barrett, C., Cline, M., Diekhans, M., Grate, L. & Hughey, R. (1999) Predicting protein structure using only sequence information. Proteins, 37 Suppl. 3, 121­125. Karplus, K., Karchin, R., Barrett, C., Tu, S., Cline, M., Diekhans, M., Grate, L., Casper, J. & Hughey, R. (2001) What is the value added by human intervention in protein structure prediction? Proteins, 45 Suppl. 5, 86­91. Karplus, K., Sj¨lander, K., Barrett, C., Cline, M., Haussler, D., Hughey, R., o Holm, L. & Sander, C. (1997) Predicting protein structure using Hidden Markov models. Proteins, 29 Suppl. 1, 134­139. Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. E. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol., 299, 501­522. Kinch, L. N., Qi, Y., Hubbard, T. J. P. & Grishin, N. V. (2003a) CASP5 target classification. Proteins, 53 Suppl. 6, 340­351.

References

269

Kinch, L. N., Wrabl, J. O., Krishna, S. S., Majumdar, I., Sadreyev, R. I., Qi, Y., Pei, J., Cheng, H. & Grishin, N. V. (2003b) CASP5 assessment of fold recognition target predictions. Proteins, 53 Suppl. 6, 395­409. Koretke, K. K., Russell, R. B., Copley, R. R. & Lupas, A. N. (1999) Fold recognition using sequence and secondary structure information. Proteins, 37 Suppl. 3, 141­148. Koretke, K. K., Russell, R. B. & Lupas, A. N. (2001) Fold recognition from sequence comparisons. Proteins, 45 Suppl. 5, 68­75. Kosinski, J., Cymerman, I. A., Feder, M., Kurowski, M. A., Sasin, J. M. & Bujnicki, J. M. (2003) A "FRankenstein's Monster" approach to comparative modeling: merging the finest fargments of fold-recognition models and iterative model refinement aided by 3D structure evaluation. Proteins, 53 Suppl. 6, 369­379. Krogh, A., Brown, M., Mian, I. S., Sj¨lander, K. & Haussler, D. (1994) Hidden o Markov models in computational biology. Applications to protein modeling. J. Mol. Biol., 235, 1501­1531. Kuncheva, L. I. & Whitaker, C. J. (2003) Measures of diversity in classifier ensembles. Mach. Learn., 51, 181­207. Kuncheva, L. I., Whitaker, C. J., Shipp, C. A. & Duin, R. P. W. (2000) Is independence good for combining classifiers? In Proceedings of the 15th International Conference on Pattern Recognition pp. 169­171 IEEE Computer Society Press, Los Alamitos, California, USA. Lagarias, J. C., Reeds, J. A., Wright, M. H. & Wright, P. E. (1998) Convergence properties of the Nelder-Mead Simplex method in low dimensions. SIAM J. Optim., 9, 112­147. Lambert, C., Leonard, N., De Bolle, X. & Depiereux, E. (2002) ESyPred3D: prediction of protein 3D structures. Bioinformatics, 18, 1250­1256.

References

270

Lemer, C. M.-R., Rooman, M. J. & Wodak, S. J. (1995) Protein structure prediction by threading methods: evaluation of current techniques. Proteins, 23, 337­355. Letunic, I., Goodstadt, L., Dickens, N. J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R. R., Ponting, C. P. & Bork, P. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucl. Acids Res., 30, 242­244. Levitt, M. (1997) Competitive assessment of protein fold recognition and alignment accuracy. Proteins, 29 Suppl. 1, 92­104. Levitt, M. & Gerstein, M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. USA, 95, 5913­ 5920. Liepinsh, E., Andersson, M., Ruysschaert, J. M. & Otting, G. (1997) Saposin fold revealed by NMR structure of NK-lysin. Nature Struct. Biol., 4, 793­795. Littlewood, B. & Miller, D. (1989) Conceptual modeling of coincident failures in multiversion software. IEEE T. Software Eeg., 15, 1596­1614. Lund, O., Nielsen, M., Lundegaard, C. & Worning, P. (2002). CPHmodels 2.0: X3M a computer program to extract 3D models. Abstract at CASP5 conference A102. Lundstrom, J., Rychlewski, L., Bujnicki, J. & Elofsson, A. (2001) Pcons: a neuralnetwork-based consensus predictor that improves fold recognition. Protein Sci., 10, 2354­2362. Lyngso, R. B., Pedersen, C. N. S. & Nielsen, H. (1999) Metrics and similarity measures for hidden Markov models. In The Proceedings of ISMB 1999 pp. 178­ 186 AAAI Press.

References

271

Mao, J. (1998) A case study on bagging, boosting, and basic ensembles of neural networks for ocr. In Proceedings of the IEEE International Joint Conference on Neural Networks pp. 1828­1833 IEEE Computer Society Press, Los Alamitos, California, USA. Marchler-Bauer, A. & Bryant, S. H. (1997) Measures of threading specificity and accuracy. Proteins, 29 Suppl. 1, 74­82. McNemar, Q. (1947) Contingency tables involving small numbers and the 2 test. J. R. Stat. Soc., Suppl. 11, 217­235. Morris, A. L., MacArthur, M. W., Hutchinson, E. G. & Thornton, J. M. (1992) Stereochemical quality of protein structure coordinates. Proteins, 12, 345­364. Mott, R. (1992) Maximum likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. B. Math. Biol., 54, 59­75. Mott, R. (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J. Mol. Biol., 300, 649­659. Moult, J., Fidelis, K., Zemla, A. & Hubbard, T. (2001) Critical assessment of methods of protein structure prediction (CASP): round IV. Proteins, 45 Suppl. 5, 2­7. Moult, J., Hubbard, T., Fidelis, K. & Pedersen, J. T. (1999) Critical assessment of methods of protein structure prediction (CASP): round III. Proteins, 37 Suppl. 3, 2­6. Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., Copley, R., Courcelle, E., Das, U., Durbin, R., Fleischmann, W., Gough, J., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McDowall, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Pagni, M., Ponting, C. P., Quevillon, E., Selengut, J., Sigrist, C. J., Silventoinen, V.,

References

272

Studholme, D. J., Vaughan, R. & Wu, C. H. (2005) InterPro, progress and status in 2005. Nucl. Acids Res., 33, D210­D205. M¨ller, A. (2002). A protein structure based annotation of genomes. PhD thesis, u Cancer Research UK and University College London. M¨ller, A., MacCallum, R. M. & Sternberg, M. J. E. (2002) Structural characu terization of the human proteome. Genome Res., 12, 1625­1641. Murzin, A. (2001) Progress in protein structure prediction. Nat. Struct. Biol., 8, 110­112. Murzin, A. G. (1999) Structure classification-based assessment of CASP3 predictions for the fold recognition targets. Proteins, 37 Suppl. 3, 88­103. Murzin, A. G. & Bateman, A. (1997) Distant homology recognition using structural classification of proteins. Proteins, 29 Suppl. 1, 105­112. Murzin, A. G. & Bateman, A. (2001) CASP2 knowledge-based approach to distant homology recognition and fold prediction in CASP4. Proteins, 45 Suppl. 5, 76­ 85. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536­540. Needleman, S. B. & Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443­453. Nelder, J. A. & Mead, R. (1965) A simplex method for function minimization. Comput. J., 7, 308­313. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997) CATH -- a Hierarchical Classification of Protein Domain Structures. Structure, 5, 1093­1108.

References

273

Ortiz, A. R., Strauss, C. E. & Olmea, O. (2002) MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci., 11, 2606­2621. Ota, M., Kawabata, T., Kinjo, A. R. & Nishikawa, K. (1999) Threading with explicit models for evolutionary conservation of structure and sequence. Proteins, 37 Suppl. 3, 126­132. Overbeek, R., Larsen, N., Walunas, T., D'Souza, M., Pusch, G., Selkov, J., Liolios, K., Joukov, V., Kaznadzey, D., Anderson, I., Bhattacharyya, A., Burd, H., Gardner, W., Hanke, P., Kapatral, V., Mikhailova, N., Vasieva, O., Osterman, A., Vonstein, V., Fonstein, M., Ivanova, N. & Kyrpides, N. (2003) The ERGOT M genome analysis and discovery system. Nucl. Acids Res., 31, 164­171. Panchenko, A., Marchler-Bauer, A. & Bryant, S. H. (1999) Threading with explicit models for evolutionary conservation of structure and sequence. Proteins, 37 Suppl. 3, 133­140. Panchenko, A. R. (2003) Finding weak similarities between proteins by sequence profile comparison. Nucl. Acids Res., 31, 683­689. Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, R., Hubbard, T. & Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 1201­1210. Pavlidis, P. & Noble, W. S. (2003) matrix2png: a utility for visualizing matrix data. Bioinformatics, 19, 295­296. Pearson, W. R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol., 183, 63­98. Pearson, W. R. (1995) Comparison of methods for searching protein science databases. Protein Sci., 4, 1145­1160.

References

274

Pearson, W. R. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 71­84. Pearson, W. R. & Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, 85, 2444­2448. Pettitt, C. S., McGuffin, L. J. & Jones, D. T. (2005) Improving sequence-based fold recognition by using 3D model quality assessment. Bioinformatics, 3509­3515. Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucl. Acids Res., 24, 3836­3845. Przybylski, D. & Rost, B. (2004) Improving fold recognition without folds. J. Mol. Biol., 341, 255­269. Rice, D. W., Fischer, D., Weiss, R. & Eisenberg, D. (1997) Fold assignments for amino acid sequences of the CASP2 experiment. Proteins, 29 Suppl. 1, 113­122. Rosen, B. (1996) Ensemble learning using decorrelated neural networks. Connect. Sci., 8, 373­383. Rost, B. (2001) Review: protein secondary structure prediction continues to rise. J. Struct. Biol., 134, 204­218. Russell, R. B., Saqi, M. A. S., Sayle, R. A., Bates, P. A. & Sternberg, M. J. E. (1997) Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J. Mol. Biol., 269, 423­439. Rychlewski, L., Fischer, D. & Elofsson, A. (2003) LiveBench-6: large-scale automated evaluation of protein structure prediction servers. Proteins, 53 Suppl. 6, 542­547. Rychlewski, L., Jaroszewski, L., Li, W. & Godzik, A. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci., 9, 232­241. 21,

References

275

Rychlewski, L., Zhang, B. & Godzik, A. (1998) Fold and function predictions for Mycoplasma genitalium proteins. Fold Des., 3, 229­238. Sadreyev, R. & Grishin, N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317­336. Salton, G. (1991) Developments in automatic text retrieval. Protein Eng., 253, 974­980. Sch¨ffer, A. A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L., Wolf, a Y. I., Koonin, E. V. & Altschul, S. F. (2001) Improving the accuracy of PSIBLAST protein database searches with composition-based statistics and other refinements. Nucl. Acids Res., 29, 2994­3005. Schapire, R. E. (1990) The strength of weak learnability. Machine Learning, 5, 197­227. Schneider, T., Stormo, G., Gold, L. & Ehrenfeucht, A. (1986) Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188, 415­431. Schultz, J., Milpetz, F., Bork, P. & Ponting, C. P. (2004) SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. USA, 95, 5857­5864. Schwede, T., Kopp, J., Guex, N. & Peitsch, M. C. (2003) SWISS-MODEL: an automated protein homology-modelling server. Nucl. Acids Res., 31, 3381­3385. Shapiro, L. & Harris, T. (2000) Finding function through structural genomics. Curr. Opin. Biotechnol., 11, 31­35. Shi, J., Blundell, T. L. & Mizuguchi, K. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structuredependent gap penalties. J. Mol. Biol., 310, 243­257.

References

276

Shindyalov, I. N. & Bourne, P. E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739­747. Siew, N., Elofsson, A., Rychlewski, L. & Fischer, D. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776­785. Sigrist, C. J. A., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A. & Bucher, P. (2002) PROSITE: a documented database using patterns and profiles as motif desciptors. Brief. Bioinform., 3, 265­274. Simons, K. T., Bonneau, R., Ruczinski, I. & Baker, D. (1999) Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins, 37 Suppl. 3, 171­176. Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J. Mol. Biol., 268, 209­225. Simons, K. T., Ruczinski, I., Kooperberg, C., Fox, B. A., Bystroff, C. & Baker, D. (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins, 34, 82­95. Sippl, M. J. (1990) Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol., 213, 859­883. Sippl, M. J., Lackner, P., Domingues, F. S., Prlic, A., Malik, R., Andreeva, A. & Wiederstein, M. (2001) Assessment of the CASP4 fold recognition category. Proteins, 45 Suppl. 5, 55­67.

References

277

Sippl, M. J. & Weitckus, S. (1992) Detection of native like models for amino acid sequences of unknown threee dimensional structure in a database of known protein conformations. Proteins, 13, 258­271. Sj¨lander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I. S. & o Haussler, D. (1996) Dirichlet mixtures: a method for improving detection of weak but significant protein sequence homology. Comput. Applic. Biosci., 12, 327­345. Skolnick, J. & Kihara, D. (2001) Defrosting the frozen approximation: PROSPECTOR -- a new approach to threading. Proteins, 42, 319­331. Skolnick, J., Zhang, Y., Arakaki, A. K., Kolinski, A., Boniecki, M., Szil´gyi, A. a & Kihara, D. (2003) TOUCHSTONE: a unified approach to protein structure prediction. Proteins, 53 Suppl. 6, 469­479. Smith, T. F. & Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195­197. Tang, C. L., Xie, L., Koh, I. Y. Y., Posy, S., Alexov, E. & Honig, B. (2003) On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. J. Mol. Biol., 334, 1043­1062. Tatusov, R. L., Altschul, S. F. & Koonin, E. V. (1994) Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl. Acad. Sci. USA, 91, 12091­12095. Tomii, K. & Akiyama, Y. (2004) FORTE: a profile-profile comparison tool for protein fold recognition. Bioinformatics, 20, 594­595. Tramontano, A. & Morea, V. (2003) Assessment of homology based predictions in CASP5. Proteins, 53 Suppl. 6, 352­368. Valencia, A. (2003) Meta, Meta N and Cyber servers. Bioinformatics, 19, 795­ 795.

References

278

Vapnik, V. (1995) The Nature of Statistical Learning Theory. Springer-Verlag, New York, USA. Vapnik, V. (1998) Statistical Learning Theory. John Wiley and Sons, Inc., New York, USA. von Grotthuss, M., Pas, J., Wyrwicz, L., Ginalski, K. & Rychlewski, L. (2003) Application of 3D-Jury, GRDB and Verify3D in fold recognition. Proteins, 53 Suppl. 6, 418­423. Wang, G. & Dunbrack, R. L. (2004) Scoring profile-to-profile sequence alignments. Protein Sci., 13, 1612­1626. Wei, B. Q., Weaver, L. H., Ferrari, A. M., Matthews, B. W. & Shoichet, B. K. (2004) Testing a flexible-receptor docking algorithm in a model binding site. J. Mol. Biol., 337, 1161­1182. Whitaker, C. J. & Kuncheva, L. I. (2003). Examining the relationship between majority vote accuracy and diversity in bagging and boosting. Technical report School of Informatics, University of Wales. Wilbur, W. J. & Lipman, D. J. (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA, 80, 726­730. Williams, M. G., Shirai, H., Shi, J., Nagendra, H. G., Mueller, J., Mizuguchi, K., Miguel, R. N., Lovell, S. C., Innis, C. A., Deane, C. M., Chen, L., Campillo, N., Burke, D. F., Blundell, T. L. & de Bakker, P. I. W. (2001) Sequence-structure homology recognition by iterative alignment refinement and comparative modeling. Proteins, 45 Suppl. 5, 92­97. Wootton, J. & Federhen, S. (1993) Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry, 17, 149­163. Wootton, J. & Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554­571.

References

279

Wu, X. & Chen, Z. (2004) Recognition of exon/intron boundaries using dynamic ensembles. In Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference pp. 486­486 IEEE Computer Society Press, Washington, USA. Xiang, Z., Soto, C. S. & Honig, B. (2002) Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc. Natl. Acad. Sci. USA, 99, 7432­7437. Yates (1934) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153­157. Yona, G. & Levitt, M. (2002) Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol., 315, 1257­1275. Zemla, A. (2003) LGA: a method for finding 3-D similarities in protein structures. Nucl. Acids Res., 31, 3370­3374. Zemla, A., Venclovas, C., Fidelis, K. & Rost, B. (1999a) A modified definition of SOV, a segment-based measure for protein secondary structure prediction assessment. Proteins, 2, 220­223. Zemla, A., Venclovas, C., Moult, J. & Fidelis, K. (1999b) Processing and analysis of CASP3 protein structure predictions. Proteins, 37 Suppl. 3, 22­29. Zemla, A., Venclovas, C., Moult, J. & Fidelis, K. (2001) Processing and evaluation of predictions in CASP4. Proteins, 45 Suppl. 5, 13­21. Zhang, Y. & Skolnick, J. (2004). Scoring function for automated assessment of protein structure template quality. Zhang, Y. & Skolnick, J. (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucl. Acids Res., 33, 2302­2309. Zhou, H. & Zhou, Y. (2004) Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins, 55, 1005­1013.

References

280

Zhou, Z. H., Jiang, Y., Yang, Y. B. & Chen, S. F. (2002) Lung cancer cell identification based on artificial neural network ensembles. Artificial Intelligence in Medicine, 24, 25­36.

Information

280 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

153735

You might also be interested in

BETA