![]() |
| Similarity measures for protein structures |
| (Agarwal, Guibas, Latombe; Har-Peled, Kolodny, Linial, Lotan, Mustafa, Wang) |
| Automatic similarity measuring of protein structures is an important problem in computational biology because of its various applications, including shape complementarity in docking, nearest neighbor searching in a large collection of conformations, automatic classification of diverse proteins, and analyzing molecular trajectories. We have investigated several approaches to measuring similarity between proteins structures. We study the computational complexity of minimizing the Hausdorff distance and its variants between two molecules, each modeled as a family of balls. We first extend the notion of Hausdorff distance from point sets to sets of balls, so that each ball in one set is matched with the nearest ball in the other set. We also consider the problem of matching two sets of balls by computing the Hausdorff distance between the union of the two sets. We developed efficient algorithms for both notions of Hausdorff distance. The surface matching problem is computationally challenging, so we focus on measuring similarity between two proteins by aligning their backbones, a widely used approach. Even the problem of aligning two backbones so that a given score function is maximized is hard, so we rely on approximation techniques. The development of approximation schemes is also justified because of the noisy nature of the protein coordinate data. We developed an approximation scheme for a wide range of score functions by evaluation on a polynomial-size grid in the rotation/translation space. We propose another approximation scheme by observing
that the full C-alpha representation of a backbone contains redundant
information because of the inherent chain topology of proteins and a limit
on their compactness due to excluded volume. We use a wavelet analysis
of random chains to justify approximating subchains by their center of
mass. Our experimental results show that the similarity measure on compressed
chains are highly correlated to those on full chains, and that we obtain
a significant speed up. Experiments were conducted on multiple databases,
including the Park-Levitt decoy set. In a related work, we use other schemes,
such as Frechet and Hausdorff measures, to approximate chains. These methods
provide better compression for the same error. |