Electronic Dissertations LibraryIdentification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study, by Ruth V. SpriggsAnalysis of ResultsThis analysis will study the ß-sheet motifs found in the PDB subset, comparing them with the theoretically possible motifs and with the results of the 1992 study. The actual motifs found in the protein structures will be compared to all those hypothetically possible to identify frequently occurring and unused motifs. The analysis will primarily determine if the trends identified in 1992 are still valid now that the PDB is larger and therefore likely to be more representative of the variety of ß-motifs found in nature. Data PreparationThe raw data files retrieved from the searches were of two formats, as described in the previous section. The searches which were performed by enumerating all the possible motifs and searching for those individually, produced results (3str.log, 4str.log, 5str.log, 6str.log, 7str.log, 8str.log, 9str.log, and 10str.log) which could be simply transferred to data tables one to 10. The other type of file (st11.lp, st12.lp, st13.lp, st14.lp, and st15.lp) were produced by the non-enumerated searches and therefore the actual motifs found were not obviously evident. The analysis of these files required four programs written by me for this purpose: 15aga.c, 15tif.c, 15mobi.c, and larg5.c, each modified slightly for the number of strands being analysed (these programs can be viewed and are explained further in appendix A). Briefly, these programs converted the angles between each strand into a list of 'parallel' and 'antiparallel', and then into the binary notation. An angle of -180 degrees to -90 degrees or 90 degrees to 180 degrees was used to indicate an antiparallel arrangement, whereas an angle of -90 degrees to 90 degrees indicated a parallel arrangement. The two different sets of data were then combined to produce tables one to 10. It became evident that not all duplicates of patterns had been removed by the programs, but these were straightforward to remove at this stage. The pattern with the highest number was kept as the motif identifier for each motif, for example, 101101010110, 011010101101, 100101010010 and 010010101001 are all the same motif and would be identified by the first pattern shown, as this is the highest number. The table heading 'Number of unique motifs possible' on table one therefore means the number of motifs once all these duplicates caused by inversion and notation are removed. Factors Which May Affect the Validity of ResultsOccurrence vs. IncidenceThe occurrence of a motif is its total frequency, whereas the incidence is the number of protein files these occurrences are found in. Often, especially with the lower strand numbers, the occurrence of a motif is larger than the incidence. This is explained by proteins containing more than one occurrence of a motif, or the same motif traced backwards, but also by the ß-barrels and split strands described below. This problem was investigated by viewing some of the results using the FRODO graphics package (Jones (1985)). The output from POSSUM can be requested in a form compatible with the FRODO graphics program, allowing protein structures to be viewed with the identified motif highlighted. By comparing the motifs found in a protein, and using the residue numbers at the end of each strand to identify individual strands, it could be seen that several matches were in fact the same match but with a split strand or traced backwards. The number of incidences of each motif found will be used in this analysis rather than the frequency of occurrence, even though there will be some occasions when a motif really does occur twice in one protein. Split StrandsIf a strand within a sheet is bent to a certain extent the Kabsch-Sander algorithm will assign axes such that it is represented as two strands in succession. The same motif will then be retrieved twice by POSSUM, once with each half of the bent strand. This problem of split strands can cause two matches to be made in a protein structure where in reality there is only one. This can create inflated figures for the occurrence of a motif, but as this analysis is using the number of proteins in which the motif is found, rather than the number of matches overall, this problem should not affect any conclusions drawn. No attempt was made to remove split stranded motifs from the study, primarily because the 1992 study did not remove them, and hence comparisons can be made. ß-Barrelsß-barrels are created when a sheet is curved to such an extent that the first strand is hydrogen bonded to the last strand, creating a cylinder. ß-barrels create many matches for one protein because their cylindrical nature means that a sheet can start anywhere in the barrel and go in any direction, potentially creating twice as many matches as there are strands in the sheet. These matches may all be of the same motif type or may vary as the starting point for the sheet changes. As this study is looking at the number of incidences of each motif, rather than the occurrence, the presence of ß-barrels should not adversely affect the analysis. The study by Stoyell (1998), into ß-sheet topology, removed the structures containing ß-barrels before the analysis, but ß-barrels were not removed from the current study, again primarily to maintain comparability to Ujah (1992). Search OrderThe lower number stranded motifs are retrieved in higher numbers than the higher number stranded. This was expected, not least because the lower number stranded motifs will frequently be sections within a higher number stranded motif. This problem could have been removed from the study had the searches been performed in order of decreasing strand number and the matches found at each search excluded from the later searches, as discussed in Suggestions for Improvements to the Current Study in the Discussions and Conclusions chapter. The trends discovered in the 1992 study, however, were deduced from searches not performed in this way and therefore this study, in order to be comparable, was performed in a similar way to in 1992. It should be kept in mind, though, when drawing conclusions from the analysis below that the numbers of proteins retrieved containing the smaller numbered motifs could be misleading. AnalysisObservations
These general observations appear to suggest that there may be a limited number of ß-motifs that actually occur in protein structures, although the absence of some motifs will, of course, be due to the small number of proteins whose structures appear in the PDB. The statistical analysis below will test this suggestion and will also investigate if the formation of particular motifs is favoured over the formation of others. Statistical AnalysisThe collection of quantitative data for this study allows conclusions to be backed by statistical analysis. The data is on a nominal scale, with sets of categories and frequency of observations within each category, therefore non-parametric statistical tests can be used, in particular the Chi2 test. The Chi2 test is an inferential statistical test that allows the conclusions to be extrapolated to the larger entire PDB population. The Chi2 test is a form of hypothesis testing where an objective decision can be made about whether a hypothesis is confirmed by the data, and thus whether to accept or reject that hypothesis. Generally, a hypothesis is generated, data is collected, the test is applied and the hypothesis is retained, refined or rejected. A null hypothesis (H0) is formulated which is a hypothesis of no difference and is usually stated to be rejected. If the null hypothesis can be rejected then an alternative hypothesis (H1) can be accepted. The null hypothesis can also be the one that you are trying to prove. The accepting or rejecting of the null hypothesis is decided at a 'level of significance'. The Chi2 value obtained is compared to values available in tables at various levels of significance. The level of significance is the probability that the null hypothesis will be mistakenly rejected when it is in fact true. The Chi2 test can be used as a test of association between sets of data, and is used in this way below to test for a difference between the pattern of data collected in 1999 and that collected in 1992, or the Chi2 test can be used to test 'goodness of fit', that is to test whether the number of observations in each category is as would be expected if the null hypothesis were true. Both forms of the test require that the expected values can be calculated. If there is no model for calculating expected values then a contingency table is used, where the pattern of values observed is used to calculate expected values (see figure below). In the majority of the analyses below, expected values are calculated using the number of each type of motif possible, the total number of incidences observed, and by assuming that all motifs are equally likely to be formed. This model is explained in more detail below at the first example of its use. The contingency table formula:
(Row total/Overall total) x Column total, e.g. the expected figure for A in 1992 would be (1991/2597) x 1231 = 944.
The Chi2 value obtained is compared to the table of values at a particular degrees of freedom, as well as level of significance. The degrees of freedom is calculated by the formula:
The formula for calculating Chi2 is: The Chi2 tests below use the figures displayed in tables one to 10. Is the general pattern of results obtained in 1999 comparable to 1992?
H0 = No difference between the distributions of motifs found in 1992 and 1999.
Expected: (calculated using contingency table)
Chi2 = 3.494 From Chi2 tables: Chi2 10% (2) = 4.605 Chi2 95% (2) = 0.103 (figure in brackets indicates degrees of freedom) H0 can therefore be accepted, with a 95% probability that it will be mistakenly rejected. This means that the distribution of three to 15 stranded motifs found in 1992 and 1999 do not differ significantly from each other, suggesting that any trends in motif incidence in 1992 are likely to have continued in the 1999 subset. Is the formation of motifs biased towards 'all parallel' and 'all antiparallel'?
With the result above that the distribution of motifs is similar in 1992 and 1999, the 1999 observed figures are tested against the expected figures.
The expected values were calculated by multiplying the fraction of possible motifs that fall into each category by the total number of incidences of all motifs. For example, there are 13 totally parallel motifs (one for each strand number) and 18412 possible motifs in total. The total number of incidences is 7259, therefore multiplying 13/18412 by 7259 gives a total of 5 expected incidences (to the nearest whole number). Chi2 = 1113816.7 Chi2 1% (2) = 9.210 H0 can, therefore, be rejected. The formation of three to 15 stranded motifs, in the 1999 study, is biased. More specifically, formation appears to be biased towards 'all parallel' / 'all antiparallel' motifs over the formation of the other possible motifs. Is the formation of motifs in 1999 biased in relation to an Inversion Coefficient?
The number of strand inversions occurring in each possible motif was counted, where an inversion can be defined as a junction between a 1 and a 0. For example, 11101011 has four inversions. To enable comparison across the strand numbers, an inversion coefficient was devised where:
Chi2 = 13726 Chi2 1% (8) = 20.090 H0 can, therefore, be rejected and the conclusion drawn that the formation of motifs is biased. The inversion coefficient is a measure of how many times the direction of the strands in a sheet changes, taking into account the number of strands present. If a greater number of inversions is taken to imply greater complexity, a low value for the inversion coefficient is indicative of a simple motif and a high value of a more complex motif. The above calculation and the actual figures in the table suggest that formation is biased towards the very simple (inversion coefficient = 0-0.1) motif and towards the motifs at the complex end of the scale (inversion coefficient = 0.61-1.0). The bias at the simple end can be explained by the bias towards 'all parallel' sheets seen above, where the inversion coefficient is 0. Similarly, the higher numbers can be explained by the bias towards 'all antiparallel' sheets, where the inversion coefficient ranges from 0.67 (in three stranded sheets) to 0.9 (in 10 stranded sheets). Is the formation of particular types of motif disfavoured or favoured in 1999?
The Ujah (1992) study made the suggestion that there were two disfavoured types of ß-sheet motif that were formed less frequently than other motifs. These were 'a parallel strand unit antiparallel to another parallel strand unit', for example, 11110000, and 'a three or more parallel strand unit adjacent to a three or more antiparallel strand unit', for example, 11110101. The following tests are designed to test whether the trends that Ujah noted in 1992 are still present, and are statistically significant, today. Units of three and four strands were used, for analysing both disfavourable motifs, to ensure that the motifs were small enough to be a part of a reasonable number of retrieved motifs.
Chi2 = 355.9 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-motif formation appears to be biased away from the formation of the three parallel strand unit antiparallel to another three parallel strand unit motif.
Chi2 = 278.7 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-motif formation appears to also be biased away from the formation of the three parallel strand unit adjacent to a three antiparallel strand unit motif. The disfavoured motifs as 'four and four' units, i.e. 11110000, 11110101, and 11111010: Figures for eight to 10 stranded motifs:
Chi2 = 39.4 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-motif formation appears to be biased away from the formation of the four parallel strand unit antiparallel to another four parallel strand unit motif.
Chi2 = 44.9 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-sheet formation is also biased away from the formation of the four parallel strand unit adjacent to a four antiparallel strand unit motif. The disfavoured motifs as 'three and four' units, i.e. 1110000, 1110101/1111010, and 1111010/1111101: Figures for seven to 10 stranded motifs:
Chi2 = 96.6 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-motif formation appears to be biased away from the formation of the three stranded parallel unit antiparallel to a four stranded parallel unit motif.
Chi2 = 83.8 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-motif formation also appears to be biased away from the formation of the three stranded parallel unit adjacent to a four stranded antiparallel unit / four stranded parallel unit adjacent to a three stranded antiparallel unit motif. The disfavoured motifs as 'three or four' units, i.e. 111000, 111010, and 111101; 11110000, 11110101, and 11111010; and 1110000, 1110101/1111010, and 1111010/1111101: Figures for eight to 10 stranded motifs:
Chi2 = 167.5 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-motif formation appears to be biased away from the formation of the three or four stranded parallel unit antiparallel to a three or four stranded parallel unit motif.
Chi2 = 240.1 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected. ß-motif formation is also biased away from the formation of the three or four stranded parallel unit adjacent to a three or four stranded antiparallel unit motif. The eight tests above indicate that the formation of 'a three or four parallel strand unit antiparallel to another three or four parallel strand unit' and 'a three or four parallel strand unit adjacent to a three or four antiparallel strand unit' are disfavoured. This conclusion can be made because the Chi2 test indicates that the difference between the observed and expected values is statistically significant, and the actual values in the table show the direction of the bias. There are some differences between which type of motif is more disfavoured, has a greater bias, with the different numbers of strands. In three and three unit motifs, the parallel antiparallel to parallel appears more disfavoured whereas in the four and four unit motifs, the parallel adjacent to antiparallel appears more disfavoured. In the three and four unit motifs, the parallel antiparallel to parallel is the more disfavoured. The final two tests suggest that, overall, the three or four strand parallel unit adjacent to a three or four strand antiparallel unit is more disfavoured than the three or four strand parallel unit antiparallel to a three or four strand parallel unit.
These tests were performed on these particular units of motif because the trend towards their being disfavoured was suggested in the Ujah (1992) study. There may be other types of motif whose formation is disfavoured, and there must also be, to balance the formation of mixed motifs, types of motif that are favoured. To try and answer these questions, and also to further analyse the extent of the control of entirely parallel and antiparallel units over sheet formation, the bias in formation of three, four and five stranded 'all parallel' and 'all antiparallel' units is investigated below.
Three stranded 'all antiparallel' units of motif, i.e. 101 / 010:
Chi2 = 1721.7 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of three stranded antiparallel units. Three stranded 'all parallel' units of motif, i.e. 111: Figures for three to 10 stranded motifs:
Chi2 = 1836.6 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of three stranded parallel units. Four stranded 'all antiparallel' units of motif, i.e. 1010 / 0101: Figures for four to 10 stranded motifs:
Chi2 = 15.2 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of four stranded antiparallel units, but the bias is not as great as for three stranded antiparallel units. Four stranded 'all parallel' units of motif, i.e. 1111: Figures for four to 10 stranded motifs:
Chi2 = 17.6 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of four stranded parallel units, but the bias is not as great as for three stranded parallel units. Five stranded 'all antiparallel' units of motif, i.e. 10101/01010: Figures for five to 10 stranded motifs:
Chi2 = 487.2 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected, but the formation of motifs in this case appears to be biased towards the formation of five stranded antiparallel units. Five stranded 'all parallel' units of motif, i.e. 11111: Figures for five to 10 stranded motifs:
Chi2 = 211.7 Chi2 1% (1) = 6.635 H0 can, therefore, be rejected, and the formation of motifs appears to be biased towards the formation of five stranded parallel units. The six tests above indicate that the formation of purely parallel and purely antiparallel units of motif is disfavoured if the units are three stranded, less disfavoured if the units are four stranded, and become favoured when the units are five stranded. One explanation of this may be that as the units of motif get longer they become more stable, and therefore even if contained within a larger mixed motif, the motif is made stable enough to favour its formation. The three stranded 'all antiparallel' unit of structure, i.e. 101, was suggested by Ujah (1992) to be a favourable unit of structure, but the above tests show this not to be the case. This unit does, however, become favourable when incorporated into a larger 'all antiparallel' unit of structure. Identification of ß-sheet motifs in three-dimensional protein
structures, using a subgraph isomorphism algorithm: an update of a 1992 study.
MSc in Information Management, 1998/1999 Electronic Dissertations Library copy; University of Sheffield - Department of Information Studies (All Rights Reserved) |