Electronic Dissertations LibraryIdentification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study, by Ruth V. SpriggsDiscussion and ConclusionsA subset of the Protein Data Bank was searched for all occurrences of three to 15 stranded ß-sheets. The motifs of the retrieved sheets were analysed, and the observations made, and Chi2 tests performed on the data, are described in the previous chapter. This chapter will bring together the conclusions drawn from these analyses of the results, and will compare the current study, and its results, with the 1992 study. Suggestions for improvements to the current study and for future studies will also be described. Results of Current (1999) Study
Comparisons to 1992 StudyMethodsUjah (1992) used a subset of the PDB as it was in February 1990 (the February 1990 release) as a dataset to search. This subset was produced by choosing the highest resolution structure for each type of protein present. In contrast, the current study used a subset (PDB_SELECT, 25 June 1999) that was produced, using sequence similarity, to be representative of the entire PDB. There is a difference in the potential scope of these two subsets, and the PDB_SELECT subset is more likely to reflect the variety of types of protein structure found in the PDB than a subset that is not specially designed to be representative. Conclusions drawn from the current study should, therefore, be more likely to reflect the real situation in the Protein Data Bank, and therefore to allow more realistic predictions as to the motif distribution in the entire protein kingdom to be made. The Ujah (1992) study reduced the February 1990 release of the PDB (554 structures) down to 114 proteins that were then searched. Ujah (1992) describes these 114 proteins as an exhaustive catalogue of the ß-motifs present in the PDB. This study reduced the approximately 9500 protein structures of the PDB (in 1999) into a subset of 1106 protein structures, therefore, on the assumption that the PDB_SELECT subset was representative of the variety of motifs found in the PDB, the searches in this study also produced an exhaustive catalogue of the ß-motifs present in the PDB. The catalogue produced in this study could be considered more accurate than that produced in 1992 as the PDB_SELECT subset is more likely to be truly representative of the entire PDB. The 1992 study enumerated all possible motifs for all three to 15 stranded ß-sheets and searched the dataset for each individually, whereas this study enumerated three to 10 stranded motifs, but searched for any 11 to 15 stranded sheet and allocated motifs afterwards. These are, in effect, exactly the same searches as all occurrences of three to 15 stranded sheets are retrieved; however the current study required the additional step of assigning motifs to retrieved 11 to 15 stranded sheets. The Ujah (1992) study inspected each protein structure to ascertain what tertiary structure the ß-motif fitted into. This allowed further conclusions to be made about motif incidence. The current study did not allow time for this stage of the analysis to be repeated. Ujah (1992) also produced lists detailing which ß-motifs had been retrieved in each of the 114 proteins analysed. The much larger number of protein structures studied in the current study, and the amount of time available, precluded the production of a similar list in 1999. ResultsThis study found many more of the possible motifs than the 1992 study. But, similar to the 1992 study, fewer and fewer of the possible motifs were found as the number of strands increased. Both studies found all the motifs possible with three, four and five strands, but Ujah (1992) found only 13 of the 20 motifs possible with six strands, whereas this study found them all, and also found 94 % of those with seven strands compared to the 47 % found in 1992. The current study also found many more of the motifs containing above seven strands than the 1992 study, but the steep decline, seen in 1992, in the percentage of possible motifs retrieved after eight strands is still visible in 1999. Overall, this study found 2.2 % of all possible motifs with three to 15 strands, whereas Ujah (1992) found only 0.53 %. This is approximately a four fold increase in the number of motifs found, but is still a very small percentage of all possible motifs. One possible reason for this increase is that the PDB is much larger than it was in 1992 and, therefore, more variety of types of motif are likely to be represented. This four fold increase in the coverage of the retrieved motifs could allow trends to be predicted with more confidence than in 1992. The distribution of three to 15 stranded motifs retrieved, in terms of entirely antiparallel and parallel vs. mixed motifs, in 1992 and 1999 do not differ significantly from each other. This suggests that any true trends in motif incidence seen in 1992 are likely to have continued in the 1999 subset. Ujah (1992) found that, in general, the smaller motifs were found in larger numbers of proteins than the larger motifs. This is still the case and can, at least partly, be explained by smaller motifs that are found actually being sections of larger motifs in many cases. In 1992, for sheets containing over eight strands the number of unique motifs possible exceeded the number of proteins in the dataset. This could have been a factor in the non discovery of some of the motifs, but in the current study the number of possible motifs for a particular number of strands never exceeds the number of proteins in the dataset. The discovery in 1992 that the entirely antiparallel motif was the most frequently occurring motif for three to eight stranded sheets is seen in 1999, and the trend continues in nine to 15 stranded sheets. Similarly, the 1992 study found entirely parallel motifs were also very frequently occurring in five to eight stranded sheets, this continues to be true in 1999 and is also seen in nine and 10 stranded sheets. The bias towards the formation of entirely parallel and entirely antiparallel motifs in 1999 was tested using the Chi2 test and the trends predicted by observation are found to be statistically significant. The 1992 study suggested that there were two disfavoured patterns in ß-sheet formation. These were ‘a parallel strand unit antiparallel to another parallel strand unit’ and ‘a three or more parallel strand unit adjacent to a three or more antiparallel strand unit’. Statistical tests were carried out on the 1999 data to investigate if these disfavoured motifs could still be seen and, as described earlier, the trend is still visible. Further tests were then carried out to try and discover the units of motif whose formation was favourable, as there must also be, to balance the formation of mixed motifs, types of motif that are favoured. These tests led to the conclusion that the formation of purely parallel and purely antiparallel units of motif is disfavoured if the units are three stranded, less disfavoured if the units are four stranded, and become favoured when the units are five stranded. Therefore, two favoured units of motif are five stranded entirely parallel units and five stranded entirely antiparallel units. Richardson (1977) concluded from her studies of ß-sheet topology that no parallel ß structure occurs in sheets of less than five strands. Her explanation for this is that considerable cooperativity is required to stabilise a parallel structure. It could, therefore, be the case that all the entirely parallel motifs of less than five strands retrieved in the current study are, in fact, sections of larger parallel sheets, and this would, at least partly, explain why three and four stranded parallel units appear disfavoured. Schulz and Schirmer (1979) agree that parallel sheets with less than four strands are rare. The three stranded ‘all antiparallel’ unit of structure, i.e. 101, was suggested by Ujah (1992) to be a favourable unit of structure, but the above tests show this not to be the case. It appears that this suggestion was made because it was the most frequently found motif in three stranded motifs in 1992, and the ‘all antiparallel’ motif continues to be the most frequently seen as the number of strands increases. The unit does, however, become favourable when incorporated into a larger ‘all antiparallel’ unit of structure. One explanation for the apparent bias against the formation of entirely parallel and entirely antiparallel units of motif getting smaller, and disappearing, as the number of strands in the unit increases, may be that as the units of motif get longer they become more stable, and therefore, even if contained within a larger mixed motif, the motif is stable enough to promote the formation of that mixed motif. Conclusions and DiscussionMore of the hypothetically possible motifs were retrieved in the current study than in 1992 (2.2 % compared to 0.53 %), but the pattern of retrieval was similar as the number of strands increases. The number of motifs found, though, is still a very small percentage of those possible. The most frequently retrieved motif is the entirely antiparallel, with the entirely parallel motif occurring next often; in alignment with this mixed motif formation is disfavoured. Schulz and Schirmer (1979) noted that parallel / antiparallel sheets are seen more often than mixed motifs, but believed that there was no preference seen between parallel and antiparallel sheets. According to Brandon and Tooze (1999), only 20% of strands found in sheets of known protein structures have parallel bonding on one side and antiparallel on the other. This type of strand is necessary for a mixed motif to form, and therefore the finding in this study that there is a strong bias against the formation of mixed ß-sheets is not unexpected. Two disfavoured mixed patterns in ß-sheet formation are ‘a three or four parallel strand unit antiparallel to another three or four parallel strand unit’ and ‘a three or four parallel strand unit adjacent to a three or four antiparallel strand unit’, and two favoured units of motif are the five stranded entirely parallel and entirely antiparallel units. There will, of course, be other types of motif whose formation are disfavoured or favoured. General observations of the data, and the statistical tests carried out, suggest that there is a limited number of ß-motifs that actually occur in the PDB and, therefore, that a limited number may appear in nature. This conclusion was also drawn in 1992 by Ujah (1992), and by Koch et al. (1992), and Ujah (1992) also predicted that the trends would still be seen when the PDB contained more structures. The current investigation has indeed found the same trends as those reported in 1992, and this is further supported by the fact that the distributions of motifs retrieved in 1992 and 1999 do not differ significantly. Known properties of ß-sheet structures can be used to explain some of the trends seen in the occurrence of particular ß-motifs. Antiparallel strands within a ß-sheet only require small loops of protein chain between these strands (ß-turns) to allow the sheet structure to form, whereas parallel strands require more chain between the strands to cover the distance between the top of one strand and the bottom of the next (crossover connections) (Richardson (1977); Fasman (1989b)). This larger section of chain requires more organisation, for example, in the form of alpha-helices, and therefore requires more stabilisation, which may explain why the entirely antiparallel motif appears more frequently then the entirely parallel motif. The frequency of incidence of entirely parallel and entirely antiparallel motifs can also be partly explained by the hydrogen bonds that hold the strands of the ß-sheet together. The angle between the hydrogen bond donors and acceptors in two strands in a sheet has an optimum position for stability. These optimum positions are best maintained in entirely parallel and entirely antiparallel sheets, making these more stable than mixed motifs and therefore more likely to form (Ujah (1992)). Antiparallel strands have hydrogen bonds that are parallel to each other as the donors and acceptors are lined up opposite each other in the two strands. The strongest hydrogen bonds are formed when the donor, acceptor and hydrogen are lined up in this way (Stryer (1988)). Parallel strands, on the other hand, require that the hydrogen bonds are at angles to each other as the donors and acceptors are not lined up, and each donor and acceptor is thus bent inwards or outwards (Brandon and Tooze (1999)); parallel sheets therefore have less of a range in their optimal hydrogen bonding position (Salemme and Weatherford (1981)) Figure 3. This again may partly explain the reason for entirely antiparallel motifs being retrieved more frequently than entirely parallel motifs. Salemme and Weatherford (1981) agree that antiparallel ß-sheets have greater conformational freedom than parallel sheets which allows them to form more diverse ‘spatial configurations’, and therefore be incorporated into more variable globular protein conformations. In ‘all-ß‘ proteins antiparallel strands are much more frequently observed than parallel. This is also the case in ‘alpha+ß‘ proteins, but in ‘alpha/ß‘ proteins parallel strands appear to be more frequently occurring (Levitt and Chothia (1976)). This can be explained in terms of the units of frequently occurring secondary structure elements described in the Proteins chapter (Levitt and Chothia (1976)): When the ß-strands are grouped together along the linear sequence of the chain, and not separated by alpha-helices (i.e. in the ‘all-ß‘ and ‘alpha+ß‘ class), ßß and ßßß units of strands in interaction occur very often, leading to antiparallel ß-sheets. This, however, is not in agreement with the results of the current study which suggest that three stranded antiparallel units of structure are not favoured. The infrequent incidence of the larger motifs may be due to the fact that proteins containing motifs of this size would need to be very large to accommodate this supersecondary structural feature. Very large proteins occur more rarely than smaller ones and therefore the incidence of the larger motifs can be expected to be as rare, and the larger motifs that do occur can also be expected to be of the more favourable types. The core of most globular proteins contains extensive regions of ß-sheet and the stable domains implicated in the folding process also have stable cores formed from ß-sheets (Alberts et al. (1994)). This, and the fact that ß-sheets inherently require the close interaction of regions separated in the amino acid sequence (Richardson (1977)), indicates that the detection and study of ß-sheets is important in the study of protein folding (Stryer (1988)), and therefore essential in the development of protein structure prediction methods. The current study may be useful in this respect as it is suggesting trends in ß-sheet formation which other proteins are likely to follow. The current study may also help in research into protein structure evolution; motifs that were not previously known as being frequently occurring may be found to be so, and whether their presence is due to divergent or convergent evolution can then be investigated. The finding here that relatively few ß-motifs appear to occur naturally may also help in the study of protein evolution by explaining why different sequences can give rise to the same structure. It may become clear that the number of motifs that are stable enough to be energetically feasible is smaller than the number of different possible sequences. The main aim of this study, as set out in the introduction, was to update a 1992 study and to discover if a conclusion made by Ujah (1992) is still the case: “...the observed distribution of the motifs’ frequencies of occurrence suggest that this will continue to be the case even when very large numbers of three-dimensional protein structures become available” (Artymiuk et al. (1994a:62)). The trends identified, in the February 1990 release of the PDB, in 1992 can still be seen in the PDB_SELECT subset in 1999. Trends that were beginning in 1992 are now more extensive and other trends have continued, suggesting that these patterns in the distribution of motifs formed in protein structures are indicative of the trends seen in the entire PDB, and therefore possibly of the trends found in the entire protein kingdom. The study described above has thus shown that the distribution of motif incidence has continued to show the same trends as were seen in 1992, and therefore that the prediction made by Ujah (1992) has proved correct. Suggestions for Improvements to the Current Study
Suggestions for Future Study
Identification of ß-sheet motifs in three-dimensional protein
structures, using a subgraph isomorphism algorithm: an update of a 1992 study.
MSc in Information Management, 1998/1999 Electronic Dissertations Library © University of Sheffield - Department of Information Studies (All Rights Reserved) |