Electronic Dissertations Library

Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study, by Ruth V. Spriggs

Proteins continued

The Prediction of Protein Structure from Sequence

    As explained above, proteins adapt a specific 3D conformation determined by their amino acid sequence. There is, therefore, much interest in the possibility of predicting the folding pattern and 3D structure of a protein from its known primary structure. It was suggested, because it was known that the amino acid sequence determines the structure, that if the chemical and physical properties of the amino acids and their behaviour in aqueous environments was known, then the structure of the sequence could be predicted (Richards (1991)). This research problem, though, has proved far from simple. It has been dubbed the Protein Folding Problem (Richards (1991)), and is still not fully solved despite much research effort (Chandonia and Karplus (1999)).

    One of the problems hampering the process, according to Reimer and Fuellen (1997), is that protein folding is so complex that researchers can not yet predict values for all the many variables that must be involved. Specifically, although the information is encoded in the amino acid sequence, it is still unknown which specific parts of this information controls the process of protein folding. The code connecting a DNA sequence to an amino acid sequence was solved in a decade, but this ‘code’ is more difficult because it is more degenerate, and there is far more diversity in amino acid sequences (Garnier (1990)).

    The ability to predict the 3D structure of a protein from its amino acid sequence would benefit many areas of research. For example, it would enable educated guesses to be made about the function of newly discovered proteins without the time and expense (Reimer and Fuellen (1997)) required to perform x-ray crystallography and NMR. Indeed, if structure prediction became very sophisticated it may remove the need for x-ray crystallography or NMR in some cases. It would also be important for the progress of protein engineering as it would enable changes to be made in the amino acid sequence with some expectation of how the change will effect the structure (Kaden et al. (1990)).

    The core of most globular proteins contains extensive regions of ß-sheet (Alberts et al. (1994)) and the stable domains implicated in the folding process above also have stable cores formed from ß-sheets (Alberts et al. (1994)). This, and the less specific theory that supersecondary structures are the first stable stage in folding, indicates that the detection of supersecondary motifs and ß-sheets is important in the study of protein folding (Stryer (1988)). The study that is the subject of this report, in identifying which ß-sheet motifs are frequently or never observed, is aiding in the production of prediction methods as it is suggesting trends which other proteins are likely to follow. As ß-sheets inherently require the close interaction of regions separated in the amino acid sequence, an understanding of ß-sheets is central to understanding the organisation and folding of proteins, and therefore essential in prediction method development (Richardson (1977)).

    There are two general classes of protein structure prediction methods. Theoretical methods that make use of the physical and chemical properties of the residues such as hydropathy, size, and charge (Biou et al. (1988)), and empirical methods which use the known sequences and structures of proteins (Clark et al. (1991)). Empirical methods can themselves be divided roughly into two types: those based on sequence homology between the protein under investigation and database proteins with known structures (Biou et al. (1988)), and those based on statistical analyses of proteins of known structure to deduce the probability of an amino acid appearing in a particular structural feature (Holley and Karplus (1989)).

Empirical Methods
Statistical

    Analyses of the relationships between known amino acid sequences and known 3D structures can help to uncover the rules that govern protein folding (Richards (1991)). For example, frequencies of occurrence of particular amino acids in each of the secondary structures can be used to predict structures that will, and will not, be created by an amino acid sequence (Stryer (1988)). However, using the frequency of occurrence of a particular amino acid in a particular secondary structure to predict the structural location of an amino acid is used with caution according to Richards (1991), because the statistical correlations found are usually not very strong.

    The two most widely used secondary structure prediction methods are two empirical statistical methods devised by Chou and Fasman (1978) and Garnier et al. (1978) (Garnier (1990)). The Chou and Fasman (1978) and Garnier et al. (1978) methods described below, however, are not the only techniques for secondary structure prediction that use statistical analyses of known structures and sequences.

Chou and Fasman (1978):
    The Chou and Fasman algorithm uses observed frequencies of amino acid residues in proteins of known structure to calculate conformational parameters for each of the 20 amino acids (Garnier (1990)). These parameters indicate the potential of each amino acid to be part of an alpha-helix or ß-sheet conformation, and were used to formulate a set of rules for determining the secondary structure fold of a protein (Yada et al. (1988)). The technique works by searching for a four to six amino acid centre of nucleation for a secondary structure, and then investigates the extension of that structure by at least four residues in both directions (Chou and Fasman (1974)). The method can also be used to calculate the conformational potential of a section of a sequence as a whole by averaging the parameter values for the residues within the section.

    The theory of this method was introduced in 1974 (Chou and Fasman (1974)), with a disclaimer that this method relies on the assumption that short range effects are predominantly responsible for secondary structure formation. The advantage of this method is that it is very simple to use, and can locate alpha-helices and ß-sheets with 70-80% accuracy (Chou and Fasman (1978)), however, when the accuracy of the overall structure prediction is measured, the highest reported accuracy, in 1990, was 57% (Garnier (1990)). The ease of the Chou and Fasman method lay in the fact that it could be applied by hand (Argos and MohanaRao (1986)), although it has now been computerised, and computerisation also allows dipeptide and tripeptide interactions to be used to reinforce the accuracy of the prediction (Chou and Fasman (1978)).

Garnier et al. (1978):
    Garnier et al. (1978) claim that their method is the simplest possible statistical procedure to still yield good results, and is superior to the Chou and Fasman technique, even though the basic conformational parameters being used are very similar. They reported an accuracy of 63%, which is lower than the 70-80% claimed by Chou and Fasman, but higher than the 57% accuracy of the Chou and Fasman method seen in 1990 (Garnier (1990)).

    The Garnier et al. (1978) method, now known as GOR, after the initials of the three authors (Garnier (1990)), is a simple method for predicting the secondary structure of globular proteins. The GOR method is based on information theory, which it uses to analyse the relationship between amino acid sequence and secondary structure (Garnier (1990)). Each residue in the sequence is unambiguously assigned a conformational state using the estimated probability of that residue being in a particular conformation given the local sequences of the chain (Garnier (1990)). Observed relative abundances of conformations, in known protein structures, are also used to help increase the accuracy of prediction (Garnier et al. (1978)).

    Developments in GOR (Gibrat et al. (1987)) increased the accuracy of predictions by seven percent. The GOR II method uses an identical algorithm to GOR, but has new refined information values that were obtained from a larger database, therefore increasing the accuracy, and GOR III takes this further by attempting to use residue pair information as well (Garnier and Robson (1989)). According to Robson and Garnier (1993) “the GOR methods have survived for ... [so long] ... because of their formal correctness, ease of reproduction and various methods of objective testing” p.506.

    A website implementing the GOR method of protein prediction can be found at http://bmbsgi11.leeds.ac.uk/bmb5dp/gor.html. Here, the researcher can input an amino acid sequence and the secondary structure elements of the protein will be predicted.

Sequence Similarity Methods

    Structure can be predicted by detecting a similarity between the sequence of the protein under investigation and the sequence of a protein whose structure is already known. These methods use the assumption that sequence similarity leads to structural similarity and in general have an accuracy of 65% (Reimer and Fuellen (1997)).

    If a homologue of the sequence under investigation is found in a protein database it is relatively easy to then obtain a 3D picture of the unknown protein by homology modelling (Segovia (1997)). However, using sequence homology is not always useful, for example, if no structures are known for proteins with similar sequence, or no similar sequences are found. There is also the fact that the extent of differences in structure between two proteins is related to the extent of the amino acid sequence divergence (Chothia and Lesk (1986)), which could effect the accuracy of predictions.

    Levin et al. (1986) report an algorithm that predicts secondary structure using short sections of sequence similarity, even if the whole protein is not homologous. A similarity matrix is used to assign a sequence similarity score between two sequences. If the score is above a certain threshold then the sections are considered homologous and thus their structures are considered homologous. Structure prediction by sequence homology was improved further by Levin et al. (1993) by the use of multiple sequence alignments. This provides extra information about sequence to structure relationships by using the alignment of the sequences of a family of homologues proteins to understand which sections of sequence are important for a particular structural feature.

Neural Networks

    Qian and Sejnowski (1988) attempted to use neural network models to predict secondary structure in 1988, but found the performance of the system disappointing. However, in the following year, Holley and Karplus (1989) found their neural network method to be equal to or better than other secondary structure prediction methods. In a training phase, the Holley and Karplus (1989) neural network uses a dataset of known sequences and structures to ‘learn’ the relationship between sequence and structure. The system can then use the ‘knowledge’ gained to predict the structure of unknown sequences. Kaden et al. (1990) also used techniques from artificial intelligence when they tackled protein structure prediction using graph theory to represent the structures. A knowledge-based system ‘learnt’ from the PDB, generated folding rules, and predicts secondary and supersecondary structures, taking into account long and short range interactions.

    In 1994, Rost and Sander (1994) improved on neural network prediction systems, to reach 72% accuracy, by including evolutionary information in the form of multiple sequence alignments, and suggested that the power of neural networks should be fully exploited for protein structure prediction. Chandonia and Karplus (1999) agree, and go further, stating that the most accurate secondary structure prediction methods will be found using neural networks. The Chandonia and Karplus (1999) system uses two neural networks to predict secondary structure and structural class, to an accuracy of 75%, and a technique that estimates the probability of finding each type of secondary structure at each position in the sequence. This enables the user to know which sections of the structure are the most accurately predicted sections. The researcher can use this method at http://www.cmpharm.ucsf.edu/~jmc/pred2ary/.

Theoretical Methods

    Richards (1991) thinks that it is conceivable that steric hindrance and hydrophobic forces alone could direct the folding of a protein, and that other interactions are present to hold the conformation once formed. These ideas have been used to devise theoretical methods for protein secondary structure prediction.

    Knowledge about steric hindrance can be used to rule out certain conformational positions for each amino acid in a chain (Fasman (1989)). By predicting the amount of rotation possible about the bonds of the backbone, other than the rigid amide bond, for each amino acid, the structures that amino acid is more likely to occur in can be determined. The amount of rotation about the bond between nitrogen and alpha-carbon atoms is called phi, and the amount between alpha-carbon and carbonyl carbon atoms is termed psi. Certain combinations of phi and psi are not accessible due to steric hindrance, but allowed ranges can be readily predicted and visualised on Ramachandran plots (Ramachandran and Sasisekharan (1968)).

    The assignment of polarity to amino acids can help in determining which sections of the sequence are likely to be found in the interior of the folded protein (Stryer (1988), although it is difficult to predict which non polar residues will make up the small fraction of such residues that are not buried (Richards (1991)). Lim (1974) devised a method of assigning secondary structures to sequences using rules based on the hydrophobic or hydrophilic nature of the amino acid residues, but the most widely used method based on hydrophobicity is that due to Kyte and Doolittle, devised in 1982, in a program called SOAP (Fasman (1989)). This method is particularly used for determining the structure of membrane proteins. The section of a protein that crosses the membrane must be in an alpha-helix or ß-sheet to enable it to be stable in the hydrophobic environment. The Kyte and Doolittle algorithm can assign transmembrane helices accurately, more accurately, according to Jahnig (1990), than it can predict structures in soluble proteins. Fasman and Gilbert (1990) disagreed, however, and advised that the prediction of transmembrane structures was not yet accurate enough to be useful.

    The shape of a protein can in principle be determined by computer using a potential energy function which, given the values describing the interactions between each pair of atoms in the chain, can deduce the lowest energy conformation for those atoms. Energy minimisation techniques were seen to have the potential to predict tertiary structure directly from amino acid sequence, but required more research due to the very subtle differences in energy seen between the folded and unfolded states (Rawlings (1987)). By 1991, this approach had been used successfully to conform, or improve, structures predicted by other methods (Richards 1991).

    Robson (1999) states that to enable biotechnology to produce new protein forms that are more robust or novel, then there must first be a solution to the protein folding problem that uses only fundamental physicochemical principles. Even with faster and faster computers, searching for the lowest energy state alone is still not feasible without some navigational rules. Work on these rules continues today.

Secondary Structure Prediction

    The methods described above for predicting the secondary structure of a protein have accuracies ranging from 57% to 75%, although often improvements in prediction accuracy can be made by using more than one method (Garnier (1990)). Biou et al. (1988) suggested the combination of three complimentary secondary structure prediction methods: GOR III (described above), Homologue (a sequence homology method), and a new method using bit patterns based on hydrophobicity. This ‘Combine’ method was designed to use the differences between the best predictions from each method to provide a single more accurate prediction. This system achieved an accuracy of 65.5% (Garnier (1990)).

Many of the programs now available to analyse the sequence of a protein are available over the web. Segovia (1997) recommends three sites for homologue searches:
http://protein.toulouse.inra.fr/prodom.html
http://www.sdsc.edu/MEME/meme/website/
http://www.blocks.fhcrc.org/

and two sites for secondary structure prediction by other methods:
http://dodo.cpmc.columbia.edu/pp/submit_adv.html
http://www.ibcp.fr/serv_pred.html

Tertiary Structure Prediction

    The majority of the prediction techniques described above were designed to predict the secondary structure of a protein from its amino acid sequence, and, although the prediction of secondary structure may be quite advanced, there is a lack of accurate techniques to predict tertiary structure (Richards (1991)). To be as useful as structure determination methods, prediction methods must be able to predict tertiary structure.

    The context in which a peptide segment folds is thought to be crucial to the structure created, i.e. the amino acid sequence further up the protein chain may play as large a part in determining the 3D structure of a segment of protein as the amino acids in the vicinity (Richards (1991)). Prediction techniques therefore need to take long range interactions into account when predicting tertiary structure (Holley and Karplus (1989)). The maximum accuracy achieved for secondary structure prediction above is 75%; this was achieved using local sequence information. (Garnier (1990)) believe that this could mean that local sequence controls 75% of structure, and the rest is due to folding, and it is not possible to get a 100% correct prediction of secondary structure without a global folding theory that takes tertiary structure interactions into account (Gibrat (1987)).

    Finkelstein (1997) describes attempts to remove some of the errors in prediction caused by taking only short range interactions into account. One way is to use the information obtained from using distantly homologous proteins, but then there is a chance that the prediction will be a generalised folding pattern rather than an exact fold. A recent answer to this problem is ‘threading’. Threading employs libraries of possible ‘target’ folds, from the PDB, and mounts the sequence under investigation onto these folds to recognise which of the folds fits the sequence. Threading can be ‘no-gap’ or ‘gapped’, where gapped threading allows gaps in the match of sequence to fold, and each alignments compatibility is then given some form of ‘score of alignment’.

    An alternative approach to tertiary structure prediction is reported by Di Francesco et al. (1997). This attempts to predict the tertiary structure from the order of the secondary structure elements in the sequence. Hidden Markov Model representations of protein topology families are trained using known secondary structure sequences and tertiary structures, and then applied to the recognition of protein topology in determined or predicted secondary structure elements.

There are some fold recognition and tertiary structure prediction systems available on the web, and those recommended by Segovia (1997) are:
http://www.doe-mbi.ucla.edu/people/frsvr/frsvr.html
http://dodo.cpmc.columbia.edu/pp/submit_adv.html
http://www-lmmb.ncifcrf.gov/~nicka/123D.html

Protein Structure Evolution

    New proteins generally evolve by alterations of old ones. Once a protein with useful surface properties has developed, its basic structure can be incorporated into many proteins by gene duplication, modification and recombination. This can explain why proteins with related functions in present day organisms often have similar amino acid sequences. Families of proteins are thought to have evolved from a single gene, through duplication to produce other genes, which then gradually accumulated mutations producing related proteins with related functions (Alberts (1994)).

    Some proteins have diverged to such an extent that their sequence similarity has dropped below 25 %, while still preserving their folding and function. These highly diverged sequences can lead to family relationships being unknown until the structures have been elucidated (Alberts et al. (1994)). In fact, structures are often more conserved in evolution than the amino acid sequences that form them (Taylor and Orengo (1989)). In contrast, some proteins have highly similar sequences, generally over short stretches, and yet are not homologous proteins and do not have similar 3D structure or function (Thonnard (1996)). In this case, convergence has occurred to form a common supersecondary structure (Rao and Rossmann (1973)).

    By comparing the sequences of proteins it is possible to build up genealogical trees with information gained from the comparison of the amino acid sequences of one particular protein, from different species. Due to evolution, as described above, the protein has a slightly different amino acid sequence in each of the species (Reimer and Fuellen (1997)). In a similar fashion, the comparison of related structures can reveal the changes in amino acid sequence that have occurred during evolution (Chothia and Lesk (1986)). This would enable a study of what has caused the maintenance of that structure though evolution, and therefore may provide insights into the protein folding mechanism. Rao and Rossmann (1973) devised a method for comparing the secondary and tertiary conformations of different but structurally related proteins. This enabled the investigation into the changes in structure that occur during evolution to make good progress. Research continues in this area, for example, Taylor and Orengo (1989) were able to produce an alignment of the sequences of two proteins based on structural criteria, using distance plot analyses and dynamic programming techniques, and Vriend and Sander (1991) devised a structure alignment algorithm that found relationships between proteins that sequence alignment methods failed to find.

    The origin of similarities between protein structures is not always clear. As explained above, a common pattern of folding between two proteins may suggest that they evolved from the same precursor, or that the structure is dictated by functional requirements, or that that particular fold is energetically favoured and therefore tends to occur spontaneously (Remington and Matthews (1980)). After studying the distribution of actually occurring ß-sheet topologies, Richardson (1977) concluded that striking topological similarities between proteins can readily happen by chance, reinforcing the latter reason for structure similarity above.

    The current study, identifying trends in the ß-sheet motifs found in proteins, may help in the study of protein structure evolution because motifs that were not previously known as being frequently occurring may be found to be so, and whether their presence is due to divergent or convergent evolution can then be investigated.

The Function of Proteins

    Proteins play crucial roles in nearly all biological processes and display a wide variety of functions both within the cell and extracellularly. For example, the majority of chemical reactions occurring in the cells of living organisms require protein catalysts called enzymes. Each enzyme is involved in the conversion of specific substrates into specific products, for example, thrombin splits a polypeptide chain into two at a specific point, and work by selectively stabilising the transition states of the reaction and by holding the reactants in the correct position and orientation. Proteins are also involved in the transport of molecules (within cells and out of cells), in the recognition of other cells, in the coordinated motion seen in muscle fibres, the excitability of nerve cells, and in the control of cell growth and differentiation (Alberts et al. (1994)).

    Globular proteins are compact structures with an array of amino acids on the surface which, in conjunction with the 3D conformation, confer functionality. The surface of each protein has a unique chemical reactivity determined by the amino acid residues exposed and their exact orientation to each other (Alberts et al. (1994)). The many different structures that proteins can adopt are therefore responsible for the vast array of functions that proteins exhibit, and for allowing the binding of so many different types of ligand. Ligands that are bound by the protein must fit snugly into the shape of the protein at the binding site (active site) to allow the weak interactions that hold them to occur (Stryer (1988)). The binding site is usually a cavity formed by the arrangement of amino acids on the surface. The rest of the structure of the protein is necessary to maintain the correct shape for the cavity to form, and to allow the binding of regulatory molecules at positions other than the active site (Alberts et al. (1994)). Drug design can take advantage of the fact that the conformation of proteins confers function by designing molecules that are the right shape to bind at specific points of the protein to affect the protein’s normal functioning.


Title Page    Next section


Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study.
MSc in Information Management, 1998/1999
Electronic Dissertations Library
© University of Sheffield - Department of Information Studies (All Rights Reserved)