Electronic Dissertations Library

Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study, by Ruth V. Spriggs

Literature Review continued

Searching Three-Dimensional Protein Structure Databases

Substructure Searching

    Substructures are sections of a larger structure, such as the ß-sheet substructures being investigated for this dissertation. The substructure is often a pharmacophore: a 3D fragment of a molecule that is thought to be responsible for the function of the molecule. Substructure searching is a matching process analogous to Boolean searching in that the database is divided into two sets; one set of molecules that do contain the substructure and one set of molecules that do not.

The Use of Graph Theory

    Work in Sheffield using graph matching techniques for the small molecules in the Cambridge Structural Database led onto the development of these techniques for searching the macromolecular structures in the Protein Data Bank (Willett (1990)). However, the use of distance matrices to search for substructures in the Protein Data Bank (PDB) had already been proposed by Richards and Kundrot (1988).

    An ongoing project reviewed in Artymiuk et al. (1992c) applies graph theoretical techniques to the 3D structures contained in the PDB, with particular interest in secondary structure motifs (patterns of ß-strand and alpha-helix secondary structure). The nodes of the graphs are vectors drawn along the major axis of the ß-strands and alpha-helices, and the edges describe the geometric relationships between pairs of these vectors in terms of the angle between two vectors, the distance of their closest approach, and the distance between their midpoints. A motif can be identified by the presence of the query graph representing it, within a larger graph using a subgraph isomorphism algorithm (Artymiuk et al. (1992c)). The Kabsch and Sander (1983) algorithm is used initially to assign the secondary structure elements to the protein structures in the PDB, unless only C alpha coordinates were available, in which case a separately devised program was used (Grindley et al. (1993)). The Kabsch-Sander algorithm assigns secondary structure to the x-ray coordinates obtained from x-ray crystallography (Gibrat et al. (1987)), using hydrogen-bonding patterns (Levin et al. 1986)).

    The subgraph isomorphism algorithm used was that due to Ullmann (1976); which was used as the basis for the POSSUM program (Protein Online Substructure Searching - Ullmann Method), developed at Sheffield (Artymiuk et al. (1992c)). Ullmann’s algorithm for subgraph isomorphism was described earlier. The development of the POSSUM program is described in Mitchell et al. (1989), including details of the three modifications to the Ullmann algorithm (as encoded by Brint and Willett (1987a)) that were necessary. Briefly, these were: to change the nodes and edges of the graph such that they represented the secondary structure elements, to adapt the tolerance input to refer to these new edge measurements, and to change the algorithm so that all possible matches are retrieved. POSSUM was tested using the ß-sheet structures found by Richardson (1977) and found to be effective. The work behind the development of POSSUM is also described in detail in Artymiuk et al. (1989) where the findings of a British Library-funded research project into the searching of 3D protein structures in the Protein Data Bank are reported.

    As described earlier, Brint and Willett (1987a) evaluated four geometric searching algorithms, using graphs based on the coordinates of the C alpha atoms of the molecule, rather than vectors of the secondary structures as described above, and structures from the Cambridge Crystallographic Data Bank. They found that Ullmann’s (1976) algorithm had a general superiority over the other algorithms for 3D chemical structure data. However, Brint et al. (1989) explain that the Ullmann algorithm alone is unsuitable for searching large macromolecules due to the computational storage required. The Lesk (1979) algorithm appeared suitable but is time consuming, therefore Brint et al. (1989) proposed a two stage procedure using the Lesk algorithm as a precursor to the subgraph isomorphism algorithm of Ullmann.

    A potential improvement to POSSUM is described in Willett (1990). This reports an algorithm devised to rank the output from POSSUM searches by approximating the overall shape of the motif. Ranking could be necessary because, to achieve complete recall, the required angles and distances are given large tolerances which, obviously, decreases the precision of the search, and could create large numbers of hits. The ranking procedure uses many measures for the distance between two vectors, rather than just the midpoint or closest approach distance as is usually used. The ranking is highly demanding of computational resources but could be implemented using parallel processing.

    Graph theoretical methods can also be used to represent and search three-dimensional patterns of side chains in protein structures (Artymiuk et al. (1994b)). The position of a side chain is denoted by a pseudoatom (node) and the edges of the graph are the distances between side chains. A user defined pattern of side chains is searched for, using the subgraph isomorphism algorithm due to Ullmann (1976), in a program called ASSAM.

    Subgraph isomorphism algorithms are generally very time consuming, due to the NP complete problem explained earlier, therefore Barnard (1993) discusses methods for increasing the efficiency. Parallel processing is discussed, as is the use of back tracking, and the use of partitioning and relaxing to reduce the number of possible mappings which must be investigated. The Ullmann algorithm uses a combination of back tracking and a relaxation-based refinement step (Barnard (1993)); this is the pruning technique described earlier (Ujah (1992)).

    Graph theory has also been used by Flower (1994), to develop a method for detecting ß-barrel secondary structures. The hydrogen bonded connectivity of ß-sheets was derived using pattern recognition techniques and expressed as a graph. Barrels correspond to rings in these graphs and were identified using ring perception algorithms. The techniques developed at Sheffield could theoretically be used to search for ß-barrels, but the method used here is more general as there are no restrictions on the size or geometry of the barrel identified.

    A new algorithm for subgraph isomorphism has recently been reported by El-Sonbaty and Ismail (1998). This decomposes graphs into smaller subgraphs and matching occurs at the level of these smaller graphs. This method is efficient and in many respects superior to the methods already in use, and in particular it overcomes most of the limitations of tree search algorithms, such as Ullmann (1976).

    An article that gives an overview of graph theory and its use in the comparison of 3D macromolecules has been written by Artymiuk et al. (1995).

The Use of Other Methods

    Many other approaches have been taken to research into 3D protein substructure searching. One approach to organising information about protein structure is to use database management systems. For example BIPED, a relational protein structure database established by Thornton and Gardner (1989) as the structure section of ISIS: a project to integrate sequence and structure databases (Sternberg and Ismail (1989). Using a query language the relational database management system can search for, for example, all the structures that a particular sequence is seen to adopt. Relational databases have their limitations though, so Thornton and Gardner (in Artymiuk et al. (1995)) devised an order-based database program called IDITIS which includes many improvements over BIPED, such as an ability to investigate side chain orientations.

    Although relational databases are fast when used for sequential and exhaustive searches, object-oriented databases are superior when used for answering complex protein structure queries, according to Gray et al. (1990). Gray et al. (1990) developed an object-oriented database to store protein structure data with objects representing primary, secondary and tertiary structures. The network structure of these objects allows complex queries to retrieve information by navigating through the data.

    The logic programming language PROLOG has also been used to represent and reason about protein structure (Rawlings et al. (1985). ß-structural motifs were defined using PROLOG rules, as a PROLOG database of simple facts about protein structure topology, and then the program can detect the presence of query motifs in PROLOG representations of database proteins.

Similarity Searching

    Structure searching involves searching for an exact match to an entire query structure. This is used, for example, before patenting a structure a researcher believes is newly discovered (Downs and Willett (1996)). Similarity searching is analogous to best match information retrieval systems as the outcome is a ranking of the database with structures most similar to the query at the top of the list.

The Use of Graph Theory

    The isomorphism algorithm used for similarity searching is the maximal common subgraph (MCS) isomorphism algorithm. A maximal common substructure is the largest set of nodes and edges common to two structures. The Crandell and Smith (1983) MCS isomorphism algorithm is iterative, gradually adding nodes and checking they still match, while the Bron and Kerbosch (1973) algorithm involves a clique detection algorithm. In a clique detection algorithm two graphs are compared to form a correspondence graph in which the node labels are the same for each pair of nodes. A clique is a subgraph of a graph in which every node is connected to every other node and that is not contained in any larger subgraph with this property (Artymiuk et al. (1992c)). The cliques of the correspondence graph are the maximal common subgraphs (Stoyell 1998). In large molecules, when the MCS is quite small, the Crandell and Smith (1983) algorithm is more appropriate than the one due to Bron and Kerbosch (1973), as used for 3D chemical similarity searching, due to limitations of storage and time.

    In further research at Sheffield a maximal common subgraph (MCS) isomorphism algorithm, based on the clique detection algorithm of Bron and Kerbosch (1973), is being used as the basis for PROTEP (PROtein Topographic Explorer Program) (Artymiuk et al. (1992c)). The same graph representation developed for POSSUM is used in PROTEP. PROTEP allows similarity searching, i.e. it identifies patterns of alpha-helices and ß-strands that a pair of proteins have in common, and can therefore find all the proteins in a database that have substructures in common with a query protein structure, or it can identify a new motif if several proteins are found to be similar in a way not seen before (Artymiuk et al. (1992c)). PROTEP is also referred to as the PROTE program (Willett (1990)), and its development is further described in Grindley et al. (1993), who show it to be both effective and efficient. Artymiuk et al. (1992a) describe using PROTEP for similarity searching in databases of 3D macromolecules.

    Gardiner et al. (1997) evaluated several different clique detection algorithms with respect to their effectiveness in MCS detection. They found that the most efficient was an algorithm described by Carraghan and Pardalos (1990) which was two to three times faster than the Bron-Kerbosch (1973) algorithm used in PROTEP. But the Bron-Kerbosch algorithm is more useful because it allows the identification of all substructures in common, not just the largest one as in the Carraghan and Pardalos algorithm. Gardiner et al. (1997) suggest combining the two to create an effective and efficient algorithm for MCS isomorphism. They developed a new version of PROTEP that used the Carraghan-Pardalos algorithm as an initial screening step before the Bron-Kerbosch MCS search, and found that it doubled the speed of searches.

    Much other work has also been done on searching 3D protein structures using graph representations. In 1984 Murthy (1984) published a ‘fast method of comparing protein structures’. Structures were represented as a set of secondary structure elements, an idea also used in the Sheffield work, but this method found the three rotational and three translational parameters which resulted in the maximum superposition of two structures. These six parameters are used to deduce the similarity between pairs of structures.

    Mizuguchi and Go (1995) report another method of comparing protein structures which uses the spatial arrangements of secondary structure elements. Each element is represented by a vector and common spatial relationships of vectors between proteins is detected. This method is very similar to the methods developed in Sheffield but whereas those methods use only two parameters, angle and distance, to specify the spatial relationship between to vectors, Mizuguchi and Go (1995) use six parameters to give a complete description of the relationship. Another difference is that in this method the nodes are not discretely connected or unconnected. Instead, continuous numbers are used, such that a similarity score or ranking concept can be used to allow more flexible querying.

    A further development of the Sheffield graph theory methods is described by Lesk (1995). This uses a tabular representation of protein folding patterns that includes information on the order of secondary structure elements along the chain, which elements interact, and their orientation with respect to each other. These tableaux make sense to people and computers which is an advantage over traditional graphs where atomic coordinates are used by the computer, which are then represented as pictures for people to understand. The similarity between structures expressed as tableaux is analysed by methods that detect the largest equal subtableaux, i.e. maximal common substructure.

    Subbarao and Haneef (1991) use graph theory, where the nodes are atoms and the edges inter atomic distances, to study protein structure. The Ullmann algorithm is used to determine the set of equivalent atoms in two molecules, retaining the order and direction of the two polypeptide chains. This method has been used in the structural alignment of protein sequences and secondary structures, and has been extended to be able to find the core structure of a family of homologous proteins. The common core of homologous proteins is a useful tool, as it enables the searching of databases for other possibly homologous proteins (Subbarao and Haneef (1991)).

    Kato and Takahashi (1997) describe a 3D motif search of protein structures using the SS3D-P program which uses a graph representation which is different again. The 3D structure of a protein is approximated using the coordinates of the alpha-carbon atoms of the main chain. This is represented using a distance matrix, which is then treated as a molecular graph. This method claims to be very flexible as the motif searched for can be a set of disconnected amino acid residues. Holm and Sander (1993) also compare protein structures by an alignment of C alpha to C alpha distance matrices.

The Use of Other Methods

    Methods have also been published that move away from the graph representation. Vriend and Sander (1991) report a fully automatic algorithm for detecting the substructures that two 3D protein structures have in common. The structures are aligned and common substructures and structural repeats are found using three steps: A diagonal plot is created from the superposition of local sequence structure fragments, a cluster analysis is performed on pairs of these fragments to identify larger structural units, and finally the set of equivalent residues are realigned and checked. Work was done much earlier than this, though, on comparing supersecondary structures in proteins, for example, Rao and Rossmann (1973) used rotation matrices and translation vectors to superimpose two structures to achieve a minimum sum of the squared distances between all ‘equivalent’ atoms.

    Orengo and Taylor (1993) report an approach for locally aligning 3D protein structure motifs which uses a modified double dynamic programming method originally designed to align global protein structures. It has been used to find the most similar substructures within a group of proteins to help in the matching of substructures to amino acid sequence (Taylor and Orengo (1989), see Protein Structure Evolution). The Taylor and Orengo (1989) method for structure alignment has been developed further to allow for similar whole structures to be searched for in the PDB (Orengo et al. (1992)). The technique uses secondary structure alignment followed by residue matching to compare protein structures. Only structures with similar secondary structures are analysed by residue alignment thereby filtering out structures that are definitely not similar, and increasing the speed of the search.

Examples of the Use of 3D Protein Searching Methods

    The first application of POSSUM to identify common folding motifs between proteins is reported in Artymiuk et al. (1990). The study searched the Protein Data Bank (PDB) for structural analogues of the CheY bacterial chemotaxis protein, and established a strong tertiary fold resemblance between the Salmonella typhimurium CheY chemotaxis protein and the GDP-binding domain of Escherichia coli elongation factor Tu.

    The study of the occurrence characteristics of ß-sheet motifs in the PDB, carried out by Ujah (1992), is reported as part of Artymiuk et al. (1994a), and described here. PROTEP was used to search for all possible ß-motifs containing three to 15 strands in the PDB. The study concluded that only just over 0.5 % of hypothetically possible ß-motifs actually occurred in the PDB, and certain ß-motifs appear to be favoured and disfavoured. The form of the connections between adjacent strands in the retrieved motifs were also assigned using Richardson’s (1977) notation, and many new connection motifs not seen by Richardson (1977) were identified. Richardson (1977) investigated the topology of ß-sheets in known protein structures and found that particular topologies occurred more frequently than they would by chance.

    PROTEP and POSSUM were used by Artymiuk et al. (1993) to discover the 3D structural resemblance between the ribonuclease H and connection domains of HIV reverse transcriptase and the ATPase fold. The PROTE program was also used to discover the 3D structural resemblance between leucine aminopeptidase and carboxypeptidase A (Artymiuk et al. (1992b)).

    Vriend and Sander (1991) report the discovery of an unexpected structural similarity between ubiquitin and ferredoxin, using their method as described above, even though ubiquitin and ferredoxin have no functional similarity.


Title Page    Next section


Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study.
MSc in Information Management, 1998/1999
Electronic Dissertations Library
© University of Sheffield - Department of Information Studies (All Rights Reserved)