Electronic Dissertations Library

Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study, by Ruth V. Spriggs

Methods

Overview

    The study has been approached from a deductive stance, beginning with the hypothesis that the trends identified in 1992 will still be seen in the 1999 data. The results of the data collection methods below are analysed in the Analysis of Results chapter to prove or disprove this hypothesis.

    A database constructed using PDB_SELECT, a representative subset of the Protein Data Bank, was searched for all occurrences of ß-sheets containing between three and 15 strands. Searches for three to 10 stranded ß-sheets were created and executed by query files, which were created by computer programs written by me for that purpose. These query files searched for each possible three to 10 stranded ß-sheet motif individually. When run, the query files created queries using a program called BETPAT and then searched the database for these queries using the POSSUM program.

    Searches for 11 to 15 stranded ß-sheets also used BETPAT and POSSUM, but without specific motifs, and therefore without the need for programs to enumerate all possible patterns. Motifs were assigned to the 11 to 15 stranded sheets after they had been retrieved. A motif, or pattern, is an arrangement of strands running parallel or antiparallel to each other, and can be a complete motif in itself or a section of a larger motif.

The Protein Data Bank and PDB_SELECT

    The Protein Data Bank (PDB) is a computerised depository for the 3D structures of macromolecules, of which the majority are proteins. The PDB was set up at Brookhaven in 1971 to collect, standardise and distribute the 3D structures of protein and nucleic acid molecules resulting from x-ray crystallography or diffraction studies ( Bernstein et al. (1977)). The database also contains bibliographic citations and assignments of primary and secondary structures (Pitchford and Taylor (1998)). The Protein Data Bank is now maintained by the Research Collaboratory for Structural Bioinformatics (RCSB) (RCSB PDB home page: http://www.rcsb.org/pdb/), and can also be found as part of the EBI (European Bioinformatics Institute) web pages at http://www2.ebi.ac.uk/pdb/.

    The entire PDB contained 9593 protein structure entries on 20 July 1999, and is continually growing. The rapid growth is due to, amongst other factors, increased investment in protein crystallography due to the popularity of structure based drug design, new techniques to enable the production of proteins in large quantities, improvements in NMR techniques, and the requirement of some journal editors that authors of published articles send their structures to the PDB (Pitchford and Taylor (1998)). As the PDB does not extract structures from the literature and is, therefore, reliant on researchers depositing their own structures, the last factor is very important.

    PDB_SELECT is a database, maintained by Uwe Hobohm and Chris Sander, that contains lists of structures that are representative subsets of the 3D protein structures held in the Protein Data Bank (PDB). The PDB_SELECT home page is available at http://swift.embl-heidelberg.de/
pdbsel/
. These representative lists are intended to save time and effort (they are redistributed freely provided they are not altered) and are used for, amongst many other examples, the analysis of protein architecture, the development of prediction methods, and model building by modular construction (Hobohm and Sander (1992)).

    The PDB_SELECT subset list used in this study contains 3D structures where no two proteins have more than 25% sequence identity (the lowest percentage identity available through the web site). This list was generated using the pdb_select algorithm (#2), as described in Hobohm et al. (1992), which uses an all-against-all Smith + Waterman sequence alignment to compare the chains. The number of similarities, above the 25% threshold, between chains are counted, and the chain with the highest number of similarities is removed from the list. If several chains have the same number of similarities, the one with best resolution structure is selected to remain in the list. Several other criteria are also used to select chains when resolution can not be used, for example, the percentage content of ‘ALA plus GLY’, and chains with higher release numbers (i.e. the latest structure) and lower ID (e.g. chain A rather than chain B) are selected. This comparison of chains is done iteratively, until only chains with less than 25% sequence identity remain in the list. A 25% identity threshold is used because homologous proteins generally have a similarity higher than 25 % over their entire length and share similar 3D structure (Thonnard (1996)). Alternative methods for creating representative subsets of the PDB are also available, for example, Boberg et al. (1992) developed a technique that used a clustering algorithm based on sequence and secondary structure content, and Fischer et al. (1995) describe a method that also uses clustering, but is based only on structure.

    This non-redundant subset of the PDB was used, rather than the whole database, because searching the entire set of nearly 10000 structures would be time consuming, require extensive computing resources, and many duplicates of proteins would be searched. Searching a subset can also make results more obviously meaningful as repetition is removed, and can make analysis more straightforward. The sample, however, must be representative, to ensure that the scope of the entire database is maintained. The creation of a representative subset of the PDB is possible because it is common for a protein’s structure to appear more than once. The same protein structure can be elucidated and deposited by more than one team of researchers, using different techniques or refinement procedures, and structures can also be included of the protein attached to different ligands or with amino acid substitutions.

    The use of the PDB_SELECT subset could be viewed as a form of stratified sampling, that is, one member of each group is included where, here, members of a group have 25% or above sequence similarity. This means that the data structure (the relative proportions of groups in the population) of the PDB is taken into account and therefore results can be extended to the entire PDB with greater confidence. The PDB however, cannot be thought of as representative of the natural protein kingdom and therefore conclusions may not be extended to the entire protein population with such confidence. This is because there are classes of protein, for example, membrane proteins, for which very few structures are known. It is assumed for this research that the PDB_SELECT subset used reflects the scope of the whole PDB, in terms of the variety of types of protein structure present, and, therefore, that conclusions made about the variety of ß-sheet motifs can be extrapolated to the entire PDB. One potential problem of this study, therefore, could be that this assumption is not valid, however, the use of sequence data to create representative subsets is an accurate method.

    The latest PDB_SELECT list, produced on 25 June 1999 and containing 1106 structures, was downloaded from the PDB_SELECT web site (ftp://ftp.embl-heidelberg.de/pub/databases/ protein_extras/pdb_select/
recent.pdb_select
) and the unique entry codes identifying which protein structures should be included in a database of the subset were extracted using four.c, a program written by me for this purpose. Four.c can be viewed and is explained further in appendix A. The list of PDB entry codes was checked against the PDB structure files held in Sheffield and the 14 missing files were downloaded from the PDB at http://www2.ebi.ac.uk/pdb/. Entries were missing from the Sheffield database because, when new releases are received from the PDB, some entries are deleted if they are considered superfluous, for example if there are already many example structures of a particular protein. A database of the representative subset was then created using the 1106 PDB protein structure files. This database was created at the time of use, to ensure that the list used was the most up to date list available.

The Search

    A binary notation is used to describe the parallel and antiparallel nature of adjacent strands in a sheet. The first strand is assigned 1, and the following strands are assigned 1 or 0 depending on their orientation in relation to the strand before. For example, a sheet of six strands with relative orientations: parallel, parallel, antiparallel, parallel, antiparallel, would be described as 111001 using this notation Figure 8.

    Graph theory is used to represent the three-dimensional protein structures. A graph is “a set of points (vertices) in space which are interconnected by a set of lines (edges)” (Gibbons (1985:1)). Each pair of points (nodes) in a fully connected graph, such as that used here, are linked by an edge. The nodes of the graph are vectors drawn along the major axis of the ß-strands or alpha-helices, and the edges describe the geometry between pairs of these vectors. The edges have three sets of data: the angle between the pair of vectors, the distance between the pair of vectors at their closest point, and the distance between the midpoints of the vectors. The entries in the database are analysed using the algorithm due to Kabsch and Sander (1983) to allocate the secondary structural motifs and the structures identified are stored as labelled graphs. The query pattern is converted to a graph representation by BETPAT. The use of graph representations allows the structures to be compared using a subgraph isomorphism algorithm. The modified Ullmann algorithm is used through the use of a program called POSSUM.

    Only ß-sheets made up of between three and 15 strands are investigated in this study because two stranded sheets are extremely common and only have limited variety, and sheets of over 15 strands are rare and, therefore, are present in only specialised situations. To enable the results to be compared with those of Ujah (1992) the searches were run separately such that all occurrences of each pattern are picked out even if they are part of a larger motif. The drawback of this is that, for example, one incidence of a three stranded pattern may be picked out in all the searches if it is actually part of a 15 stranded sheet. The 1992 study found that smaller motifs occurred more frequently than larger ones, but states that many of these will be subsets of larger motifs (Artymiuk et al. (1994a)). One way of avoiding this problem is explained in the Discussion and Conclusions chapter.

    Each possible motif in sheets containing three to 10 strands was searched for individually, whereas any 11 to 15 stranded sheet was searched for and patterns assigned to the retrieved sheets. This method was employed because 11 to 15 strands allows for a very large number of possible motifs whose enumeration was not time effective for the number of sheets that were expected to be retrieved.

BETPAT

    The formal search queries for input into POSSUM were created using a program called BETPAT. This constructs a query that contains the angles and distances between neighbouring strands, and the distances between next nearest neighbour strands (Stoyell (1998)), for a motif entered in the binary notation. The BETPAT output is a labelled graph represented using a 2D matrix.

    The inter-strand torsion angle is set at -25 degrees for neighbouring parallel strands, the minus denotes that sheets generally twist in an anticlockwise direction, and +155 degrees for neighbouring antiparallel strands. The closest approach distance is set at 4.5 and 9.0 Angstroms for nearest and next nearest neighbour strands respectively. A midpoint distance is also required as two strands are not necessarily in a sheet when the closest approach distance has the right value Figure 9. The order of the strands in the protein chain is set as unimportant.

    Programs, written by me, produced the input needed, to create a query using BETPAT and search using POSSUM, for each of the possible three to 10 stranded motifs. These programs are rvs3c.c, rvs4c.c, rvs5c.c, rvs6c.c, rvs7c.c, rvs8c.c, rvs9c.c, and rvs10c.c (rvs3c.c, rvs4c.c, and rvs7c.c can be viewed, and are explained further, in appendix A). These programs created query files (3str.com, 4str.com, 5str.com, 6str.com, 7str.com, 8str.com, 9str.com, and 10str.com), which typed out the input needed into BETPAT and then POSSUM for each search to be carried out. To create queries for the 11 to 15 stranded sheets BETPAT, and POSSUM, were also used but these five searches were input by hand.

POSSUM

    POSSUM (Protein Online Substructure Searching - Ullmann Method) searches for user defined query motifs, using graph representations, in the 3D structures of the PDB. It uses a subgraph isomorphism algorithm in the form of a modified version of the algorithm due to Ullmann (1976), and is written in FORTRAN 77 (Artymiuk et al. (1992c)). The query graph and a subgraph of a protein structure are said to be isomorphic when there is a one-to-one correspondence between their vertices, and between their edges, with corresponding edges joining corresponding vertices (Ujah (1992)). The Ullmann (1976) algorithm is a backtracking tree-searching algorithm where the leaves of the tree are the terminal vertices of a structure, that is, those with valence one, and the root is the vertex furthest from the leaves (Ujah (1992)). These vertices are examined, by travelling along the edges, by a depth-first search, that is, the search travels down a path from one vertex to the next moving further down the tree towards the leaves until a mismatch or dead-end is reached. The Ullmann algorithm is efficient because it has a refinement procedure that identifies, and eliminates, branches leading to mismatches early on in the matching process (Ullmann (1976)).

    A slightly modified version of an algorithm by Kabsch and Sander (1983) is used to assign the regions of secondary structure to the PDB entries (Mitchell et al. (1989)). The major axis of the secondary structure elements are calculated using the position of the residues that make them up, and then the angles and distances between the axis are calculated. These data are entered in a connection table, or matrix, that is used to represent the protein structure. All the proteins in the PDB are encoded using this procedure and a similar representation is used for the query motif, as described above.

    As well as the user-defined motif encoded by BETPAT, POSSUM requires that the user input tolerances for the values of distances and angles to allow for the inherent twist found in most ß-sheets. The enumerated motif searches (*str.com query files) were assigned angular tolerances of 65 degrees and distance tolerances of 40%. The *str.com query files were run overnight to search POSSUM for all three to 10 stranded ß-sheet motifs. The output from these searches was saved at *str.log (3str.log, 4str.log, 5str.log, 6str.log, 7str.log, 8str.log, 9str.log, and 10str.log), and an edited version of 6str.log can be viewed in appendix B. These searches were successful apart from the final 24 motifs in the 10 strand file. Here BETPAT failed, due to an error in the four digit code required as an identifier for each motif (the procedure for assigning the code broke down). These remaining 24 patterns were therefore searched for individually, so that the rest of the data in the 10str.log file could still be used.

    The 11 to 15 stranded motifs were searched for by entering an all parallel motif of the right strand number and setting the angular tolerance to 180 degrees. This retrieved all possible motifs because the angle tolerance allowed antiparallel strands to be viewed as parallel. These searches were also assigned a 40% distance measurement tolerance. These searches, though, were performed on a modified version of POSSUM which generated output files that additionally included the angles and distances between the strands in the retrieved sheets. The inclusion of these figures allowed motifs to be assigned to the retrieved sheets. The searches for all 11 to 15 stranded sheets produced output files called st*.lp (st11.lp, st12.lp, st13.lp, st14.lp, and st15.lp), and an edited version of st15.lp can be viewed in appendix B.

Search Output

    The two types of search detailed above resulted in two sets of data files: *str.log and st*.lp. *str.log files contain the query input and an output consisting of the number of matches found for each pattern, the number of protein files those matches were found in, and lists of the names of those proteins and how many matches they each contain. st*.lp files contain the protein entry codes and the angles and distances between strands in sheets found in those proteins. One part of the results analysis involves allocating motifs to the sheets in these st*.lp files. In addition, the 24 missing 10 stranded searches each produced their own *.lp file; the data from these was incorporated into the other data in tables one-10.


Title Page    Next section


Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study.
MSc in Information Management, 1998/1999
Electronic Dissertations Library
© University of Sheffield - Department of Information Studies (All Rights Reserved)