Electronic Dissertations Library

Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study, by Ruth V. Spriggs

Analysis of Results

    This analysis will study the ß-sheet motifs found in the PDB subset, comparing them with the theoretically possible motifs and with the results of the 1992 study. The actual motifs found in the protein structures will be compared to all those hypothetically possible to identify frequently occurring and unused motifs. The analysis will primarily determine if the trends identified in 1992 are still valid now that the PDB is larger and therefore likely to be more representative of the variety of ß-motifs found in nature.

Data Preparation

    The raw data files retrieved from the searches were of two formats, as described in the previous section. The searches which were performed by enumerating all the possible motifs and searching for those individually, produced results (3str.log, 4str.log, 5str.log, 6str.log, 7str.log, 8str.log, 9str.log, and 10str.log) which could be simply transferred to data tables one to 10. The other type of file (st11.lp, st12.lp, st13.lp, st14.lp, and st15.lp) were produced by the non-enumerated searches and therefore the actual motifs found were not obviously evident. The analysis of these files required four programs written by me for this purpose: 15aga.c, 15tif.c, 15mobi.c, and larg5.c, each modified slightly for the number of strands being analysed (these programs can be viewed and are explained further in appendix A). Briefly, these programs converted the angles between each strand into a list of 'parallel' and 'antiparallel', and then into the binary notation. An angle of -180 degrees to -90 degrees or 90 degrees to 180 degrees was used to indicate an antiparallel arrangement, whereas an angle of -90 degrees to 90 degrees indicated a parallel arrangement.

    The two different sets of data were then combined to produce tables one to 10. It became evident that not all duplicates of patterns had been removed by the programs, but these were straightforward to remove at this stage. The pattern with the highest number was kept as the motif identifier for each motif, for example, 101101010110, 011010101101, 100101010010 and 010010101001 are all the same motif and would be identified by the first pattern shown, as this is the highest number. The table heading 'Number of unique motifs possible' on table one therefore means the number of motifs once all these duplicates caused by inversion and notation are removed.

Factors Which May Affect the Validity of Results

Occurrence vs. Incidence

    The occurrence of a motif is its total frequency, whereas the incidence is the number of protein files these occurrences are found in. Often, especially with the lower strand numbers, the occurrence of a motif is larger than the incidence. This is explained by proteins containing more than one occurrence of a motif, or the same motif traced backwards, but also by the ß-barrels and split strands described below. This problem was investigated by viewing some of the results using the FRODO graphics package (Jones (1985)). The output from POSSUM can be requested in a form compatible with the FRODO graphics program, allowing protein structures to be viewed with the identified motif highlighted. By comparing the motifs found in a protein, and using the residue numbers at the end of each strand to identify individual strands, it could be seen that several matches were in fact the same match but with a split strand or traced backwards. The number of incidences of each motif found will be used in this analysis rather than the frequency of occurrence, even though there will be some occasions when a motif really does occur twice in one protein.

Split Strands

    If a strand within a sheet is bent to a certain extent the Kabsch-Sander algorithm will assign axes such that it is represented as two strands in succession. The same motif will then be retrieved twice by POSSUM, once with each half of the bent strand. This problem of split strands can cause two matches to be made in a protein structure where in reality there is only one. This can create inflated figures for the occurrence of a motif, but as this analysis is using the number of proteins in which the motif is found, rather than the number of matches overall, this problem should not affect any conclusions drawn. No attempt was made to remove split stranded motifs from the study, primarily because the 1992 study did not remove them, and hence comparisons can be made.

ß-Barrels

    ß-barrels are created when a sheet is curved to such an extent that the first strand is hydrogen bonded to the last strand, creating a cylinder. ß-barrels create many matches for one protein because their cylindrical nature means that a sheet can start anywhere in the barrel and go in any direction, potentially creating twice as many matches as there are strands in the sheet. These matches may all be of the same motif type or may vary as the starting point for the sheet changes. As this study is looking at the number of incidences of each motif, rather than the occurrence, the presence of ß-barrels should not adversely affect the analysis. The study by Stoyell (1998), into ß-sheet topology, removed the structures containing ß-barrels before the analysis, but ß-barrels were not removed from the current study, again primarily to maintain comparability to Ujah (1992).

Search Order

    The lower number stranded motifs are retrieved in higher numbers than the higher number stranded. This was expected, not least because the lower number stranded motifs will frequently be sections within a higher number stranded motif. This problem could have been removed from the study had the searches been performed in order of decreasing strand number and the matches found at each search excluded from the later searches, as discussed in Suggestions for Improvements to the Current Study in the Discussions and Conclusions chapter. The trends discovered in the 1992 study, however, were deduced from searches not performed in this way and therefore this study, in order to be comparable, was performed in a similar way to in 1992. It should be kept in mind, though, when drawing conclusions from the analysis below that the numbers of proteins retrieved containing the smaller numbered motifs could be misleading.

Analysis

Observations
  • All theoretically possible motifs are found in protein structures in the PDB subset for sheets containing up to and including six strands. 94 % of seven stranded sheet motifs are found, but then the percentage of motifs retrieved drops steadily to 30 % of 10 stranded, and drops steeply to 0.17 % of 15 stranded (Table 1).


  • The number of different motifs found for each strand number increases with strand number to a peak of 83 for 10 stranded sheets, before decreasing steadily as the number of strands continues to rise, despite the number of possible motifs continuing to increase (Table 1).
  • For all ß-sheets found, the motif composed of entirely antiparallel strands is the most frequently found motif with that number of strands (Table 3), i.e. the mode is consistently the 'all antiparallel' motif.
  • While the 'all antiparallel' is the most frequently retrieved motif for three to 15 stranded sheets, 'all parallel' motifs are also very frequently found for five to 10 stranded sheets (Table 3).
  • Only 2.2 % of all possible three to 15 stranded motifs are found in the PDB subset (Table 1).
  • Smaller motifs are retrieved from a greater number of proteins than larger motifs (Table 2).
  • The number of unique motifs possible for each number stranded sheet never exceeds the number of proteins in the dataset; however, the total number of possible motifs is 18412, which is more than the number of proteins in the dataset (i.e. 1106) (Table 1).

    These general observations appear to suggest that there may be a limited number of ß-motifs that actually occur in protein structures, although the absence of some motifs will, of course, be due to the small number of proteins whose structures appear in the PDB. The statistical analysis below will test this suggestion and will also investigate if the formation of particular motifs is favoured over the formation of others.

Statistical Analysis

    The collection of quantitative data for this study allows conclusions to be backed by statistical analysis. The data is on a nominal scale, with sets of categories and frequency of observations within each category, therefore non-parametric statistical tests can be used, in particular the Chi2 test. The Chi2 test is an inferential statistical test that allows the conclusions to be extrapolated to the larger entire PDB population.

    The Chi2 test is a form of hypothesis testing where an objective decision can be made about whether a hypothesis is confirmed by the data, and thus whether to accept or reject that hypothesis. Generally, a hypothesis is generated, data is collected, the test is applied and the hypothesis is retained, refined or rejected. A null hypothesis (H0) is formulated which is a hypothesis of no difference and is usually stated to be rejected. If the null hypothesis can be rejected then an alternative hypothesis (H1) can be accepted. The null hypothesis can also be the one that you are trying to prove. The accepting or rejecting of the null hypothesis is decided at a 'level of significance'. The Chi2 value obtained is compared to values available in tables at various levels of significance. The level of significance is the probability that the null hypothesis will be mistakenly rejected when it is in fact true.

    The Chi2 test can be used as a test of association between sets of data, and is used in this way below to test for a difference between the pattern of data collected in 1999 and that collected in 1992, or the Chi2 test can be used to test 'goodness of fit', that is to test whether the number of observations in each category is as would be expected if the null hypothesis were true. Both forms of the test require that the expected values can be calculated. If there is no model for calculating expected values then a contingency table is used, where the pattern of values observed is used to calculate expected values (see figure below). In the majority of the analyses below, expected values are calculated using the number of each type of motif possible, the total number of incidences observed, and by assuming that all motifs are equally likely to be formed. This model is explained in more detail below at the first example of its use.

The contingency table formula:
A B total
1992 96610251991
1999265341606
123113662597
The expected figure at each position can be worked out from the observed figures in the table by
(Row total/Overall total) x Column total,
e.g. the expected figure for A in 1992 would be (1991/2597) x 1231 = 944.

    The Chi2 value obtained is compared to the table of values at a particular degrees of freedom, as well as level of significance. The degrees of freedom is calculated by the formula:
Degrees of freedom = (columns-1)(rows-1).

The formula for calculating Chi2 is:

Chi2 = Sum of ((Oi-Ei)2 / Ei)

where Oi is the observed frequency of category i and Ei is the expected frequency of category i. However, if the degrees of freedom is 1 then Yates' correction should be used (Cohen (1988)) which involves minusing 0.5 from the 'Oi-Ei' figure, i.e.:

Chi2 = Sum of ((Oi-Ei-0.5)2 / Ei)

The Chi2 tests below use the figures displayed in tables one to 10.

Is the general pattern of results obtained in 1999 comparable to 1992?

H0 = No difference between the distributions of motifs found in 1992 and 1999.
Figures for all three to 15 stranded motifs retrieved:
Observed:
Totally parallel Totally antiparallel Mixed Total
1992 148 248 436 832
1999 1155 2064 4040 7259
Total 1303 2312 4476 8091

Expected: (calculated using contingency table)
Totally parallel Totally antiparallel Mixed
1992 134 238 460
1999 1169 2074 4016

Chi2 = 3.494
From Chi2 tables:
Chi2 10% (2) = 4.605
Chi2 95% (2) = 0.103
(figure in brackets indicates degrees of freedom)

H0 can therefore be accepted, with a 95% probability that it will be mistakenly rejected.
This means that the distribution of three to 15 stranded motifs found in 1992 and 1999 do not differ significantly from each other, suggesting that any trends in motif incidence in 1992 are likely to have continued in the 1999 subset.

Is the formation of motifs biased towards 'all parallel' and 'all antiparallel'?

With the result above that the distribution of motifs is similar in 1992 and 1999, the 1999 observed figures are tested against the expected figures.
H0 = Formation of motifs is unbiased.
Figures for all three to 15 stranded motifs retrieved:
Totally parallel Totally antiparallel Mixed
Observed 1155 2064 4040
Expected 5 5 7249

The expected values were calculated by multiplying the fraction of possible motifs that fall into each category by the total number of incidences of all motifs. For example, there are 13 totally parallel motifs (one for each strand number) and 18412 possible motifs in total. The total number of incidences is 7259, therefore multiplying 13/18412 by 7259 gives a total of 5 expected incidences (to the nearest whole number).

Chi2 = 1113816.7

Chi2 1% (2) = 9.210
H0 can, therefore, be rejected. The formation of three to 15 stranded motifs, in the 1999 study, is biased. More specifically, formation appears to be biased towards 'all parallel' / 'all antiparallel' motifs over the formation of the other possible motifs.

Is the formation of motifs in 1999 biased in relation to an Inversion Coefficient?

The number of strand inversions occurring in each possible motif was counted, where an inversion can be defined as a junction between a 1 and a 0. For example, 11101011 has four inversions. To enable comparison across the strand numbers, an inversion coefficient was devised where:
Inversion coefficient = Number of inversions / Number of strands

H0 = Formation of motifs in unbiased.
Figures for three to 10 stranded motifs:
(Figures for 11 to 15 stranded motifs could not be included because these had not been fully enumerated, by the type of program used to create the query files for the three to 10 stranded searches, and therefore the expected values could not be calculated)
Inversion Coefficient
(2 d.p.)
0-0.1 0.11-
0.2
0.21-
0.3
0.31-
0.4
0.41-
0.5
0.51-
0.6
0.61-
0.7
0.71-
0.8
0.81-
0.9
Observed 1154 448 438 942 916 496 1017 1131 468
Expected 164 454 1047 1577 1767 1073 644 227 63

Chi2 = 13726

Chi2 1% (8) = 20.090
H0 can, therefore, be rejected and the conclusion drawn that the formation of motifs is biased.

    The inversion coefficient is a measure of how many times the direction of the strands in a sheet changes, taking into account the number of strands present. If a greater number of inversions is taken to imply greater complexity, a low value for the inversion coefficient is indicative of a simple motif and a high value of a more complex motif. The above calculation and the actual figures in the table suggest that formation is biased towards the very simple (inversion coefficient = 0-0.1) motif and towards the motifs at the complex end of the scale (inversion coefficient = 0.61-1.0). The bias at the simple end can be explained by the bias towards 'all parallel' sheets seen above, where the inversion coefficient is 0. Similarly, the higher numbers can be explained by the bias towards 'all antiparallel' sheets, where the inversion coefficient ranges from 0.67 (in three stranded sheets) to 0.9 (in 10 stranded sheets).

Is the formation of particular types of motif disfavoured or favoured in 1999?

    The Ujah (1992) study made the suggestion that there were two disfavoured types of ß-sheet motif that were formed less frequently than other motifs. These were 'a parallel strand unit antiparallel to another parallel strand unit', for example, 11110000, and 'a three or more parallel strand unit adjacent to a three or more antiparallel strand unit', for example, 11110101. The following tests are designed to test whether the trends that Ujah noted in 1992 are still present, and are statistically significant, today. Units of three and four strands were used, for analysing both disfavourable motifs, to ensure that the motifs were small enough to be a part of a reasonable number of retrieved motifs.

H0 = Formation is unbiased, for all eight tests below.
The number of strands included in the calculations is determined by the fact that those over 10 strands have not been enumerated thus preventing the calculation of expected values, and by the minimum number of strands required to accommodate the motif being tested for.
The expected values are calculated, as described above, by multiplying the total number of incidences by the probability of a motif being the type of motif being investigated if formation were random.


The disfavoured motifs as 'three and three' units, i.e. 111000, 111010, and 111101:

Figures for six to 10 stranded motifs:
'111000' Other mixed
Observed 23 1619
Expected 290 2097

Chi2 = 355.9
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-motif formation appears to be biased away from the formation of the three parallel strand unit antiparallel to another three parallel strand unit motif.

'111010/111101' Other mixed
Observed 427 1215
Expected 851 1535

Chi2 = 278.7
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-motif formation appears to also be biased away from the formation of the three parallel strand unit adjacent to a three antiparallel strand unit motif.


The disfavoured motifs as 'four and four' units, i.e. 11110000, 11110101, and 11111010:

Figures for eight to 10 stranded motifs:
'11110000' Other mixed
Observed 6 620
Expected 15 781

Chi2 = 39.4
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-motif formation appears to be biased away from the formation of the four parallel strand unit antiparallel to another four parallel strand unit motif.

'11110101/11111010' Other mixed
Observed 20 606
Expected 49 748

Chi2 = 44.9
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-sheet formation is also biased away from the formation of the four parallel strand unit adjacent to a four antiparallel strand unit motif.


The disfavoured motifs as 'three and four' units, i.e. 1110000, 1110101/1111010, and 1111010/1111101:

Figures for seven to 10 stranded motifs:
'1110000/1111000' Other mixed
Observed 16 1435
Expected 118 1334

Chi2 = 96.6
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-motif formation appears to be biased away from the formation of the three stranded parallel unit antiparallel to a four stranded parallel unit motif.

'1110101/1111010' Other mixed
Observed 99 1352
Expected 225 1227

Chi2 = 83.8
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-motif formation also appears to be biased away from the formation of the three stranded parallel unit adjacent to a four stranded antiparallel unit / four stranded parallel unit adjacent to a three stranded antiparallel unit motif.


The disfavoured motifs as 'three or four' units, i.e. 111000, 111010, and 111101; 11110000, 11110101, and 11111010; and 1110000, 1110101/1111010, and 1111010/1111101:

Figures for eight to 10 stranded motifs:
'111000/11110000/1110000' Other mixed
Observed 34 773
Expected 190 618

Chi2 = 167.5
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-motif formation appears to be biased away from the formation of the three or four stranded parallel unit antiparallel to a three or four stranded parallel unit motif.

'111010/11110101/1110101' Other mixed
Observed 263 544
Expected 480 328

Chi2 = 240.1
Chi2 1% (1) = 6.635
H0 can, therefore, be rejected.
ß-motif formation is also biased away from the formation of the three or four stranded parallel unit adjacent to a three or four stranded antiparallel unit motif.

    The eight tests above indicate that the formation of 'a three or four parallel strand unit antiparallel to another three or four parallel strand unit' and 'a three or four parallel strand unit adjacent to a three or four antiparallel strand unit' are disfavoured. This conclusion can be made because the Chi2 test indicates that the difference between the observed and expected values is statistically significant, and the actual values in the table show the direction of the bias. There are some differences between which type of motif is more disfavoured, has a greater bias, with the different numbers of strands. In three and three unit motifs, the parallel antiparallel to parallel appears more disfavoured whereas in the four and four unit motifs, the parallel adjacent to antiparallel appears more disfavoured. In the three and four unit motifs, the parallel antiparallel to parallel is the more disfavoured. The final two tests suggest that, overall, the three or four strand parallel unit adjacent to a three or four strand antiparallel unit is more disfavoured than the three or four strand parallel unit antiparallel to a three or four strand parallel unit.

    These tests were performed on these particular units of motif because the trend towards their being disfavoured was suggested in the Ujah (1992) study. There may be other types of motif whose formation is disfavoured, and there must also be, to balance the formation of mixed motifs, types of motif that are favoured. To try and answer these questions, and also to further analyse the extent of the control of entirely parallel and antiparallel units over sheet formation, the bias in formation of three, four and five stranded 'all parallel' and 'all antiparallel' units is investigated below.
H0 = Formation is unbiased, for all tests below.

Three stranded 'all antiparallel' units of motif, i.e. 101 / 010:

Figures for three to 10 stranded motifs:
101 / 010 Other motifs
Observed 3789 3227
Expected 5287 1729

Chi2 = 1721.7

Chi2 1% (1) = 6.635
H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of three stranded antiparallel units.


Three stranded 'all parallel' units of motif, i.e. 111:

Figures for three to 10 stranded motifs:
111 Other motifs
Observed 2430 4589
Expected 4189 2827

Chi2 = 1836.6

Chi2 1% (1) = 6.635
H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of three stranded parallel units.


Four stranded 'all antiparallel' units of motif, i.e. 1010 / 0101:

Figures for four to 10 stranded motifs:
1010/0101 Other motifs
Observed 2350 3056
Expected 2209 3197

Chi2 = 15.2

Chi2 1% (1) = 6.635
H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of four stranded antiparallel units, but the bias is not as great as for three stranded antiparallel units.


Four stranded 'all parallel' units of motif, i.e. 1111:

Figures for four to 10 stranded motifs:
1111 Other motifs
Observed 1510 3896
Expected 1652 3754

Chi2 = 17.6

Chi2 1% (1) = 6.635
H0 can, therefore, be rejected, and the formation of motifs appears to be biased away from the formation of four stranded parallel units, but the bias is not as great as for three stranded parallel units.


Five stranded 'all antiparallel' units of motif, i.e. 10101/01010:

Figures for five to 10 stranded motifs:
10101/01010 Other motifs
Observed 1227 2525
Expected 700 3052

Chi2 = 487.2

Chi2 1% (1) = 6.635
H0 can, therefore, be rejected, but the formation of motifs in this case appears to be biased towards the formation of five stranded antiparallel units.


Five stranded 'all parallel' units of motif, i.e. 11111:

Figures for five to 10 stranded motifs:
11111 Other motifs
Observed 847 2905
Expected 535 3217

Chi2 = 211.7

Chi2 1% (1) = 6.635
H0 can, therefore, be rejected, and the formation of motifs appears to be biased towards the formation of five stranded parallel units.

    The six tests above indicate that the formation of purely parallel and purely antiparallel units of motif is disfavoured if the units are three stranded, less disfavoured if the units are four stranded, and become favoured when the units are five stranded. One explanation of this may be that as the units of motif get longer they become more stable, and therefore even if contained within a larger mixed motif, the motif is made stable enough to favour its formation. The three stranded 'all antiparallel' unit of structure, i.e. 101, was suggested by Ujah (1992) to be a favourable unit of structure, but the above tests show this not to be the case. This unit does, however, become favourable when incorporated into a larger 'all antiparallel' unit of structure.


Title Page    Next section


Identification of ß-sheet motifs in three-dimensional protein structures, using a subgraph isomorphism algorithm: an update of a 1992 study.
MSc in Information Management, 1998/1999
Electronic Dissertations Library
&#copy; University of Sheffield - Department of Information Studies (All Rights Reserved)