Shape-basis for protein structures

(Guibas; Koehl, Kolodny, Levitt)


We are interested in developing efficient encodings of protein shapes for structural comparisons, decoy generation, etc. We have approached this problem by clustering the shapes of a select subset of protein fragments from the PDB and experimented with different fragment lengths, varying between four and seven residues. The distance or fit between two fragments is evaluated by using their cRMS distance. The goal is to use these shape clusters in order to produce a simple shape library for describing the shapes of all proteins, at a level above that of individual residues, but well below that of secondary structure units such as alpha-helices and beta-strands. To our
surprise, none of the classical clustering methods worked well for this
purpose. We had to invent a novel clustering algorithm we have called
"simulated annealing k-means.'' Repeatedly, after clustering the data using k-means, our algorithm considers a randomized improvement step in which two clusters are merged, and another cluster is split into two (thus conserving the total number of clusters). The two clusters to be merged are selected so that clusters close to each other are chosen with higher probability. Similarly, the cluster to be split is more likely to be chosen among those that are most spread out. This Monte-Carlo process is controlled by a temperature variable, in a standard annealing fashion. We assess the performance of a clustering algorithm by how well the resulting shape library fits fragments from a selected test set of proteins in the PDB (different from the one used in forming the clusters). Our simulated annealing k-means
performed far better than any other standard clustering method The
constructed libraries of fragments allow us to compute approximations for protein shapes, as fragments have the property that fixing the position of any three consecutive residues in the fragment fully determines its position in 3-space. Moreover, any three consecutive alpha-carbon atoms in one fragment can be superimposed on any three consecutive alpha-carbon atoms in another fragment. Consequently, we can construct a chain of fragments such that the"head'' of each fragment is superimposed on the "tail''of the fragment preceding it. For a given protein structure, we can
construct a greedy approximation of its shape by repeatedly finding the fragment in our library that best fits the rest of the protein, while overlapping by three residues with the previous fragment. A number of more sophisticated strategies are also possible. Such approximations allow us to encode protein shapes compactly as strings over the alphabet of shapes in our fragment library. Clearly how well we can fit a given protein depends on the length of the fragments and the size of the library we allow. We have defined a natural notion of complexity for a shape library and found experimentally that our libraries fit protein shapes with higher accuracy than all prior such attempts, across a broad spectrum of library complexities. This provides evidence that simulated annealing k-means is a useful approach towards clustering the shapes of protein fragments and describing protein
shapes using the resulting shape alphabet.