| Protein simulations often yield many (100,000's or more)
sampled conformations of a protein. An important problem is to find pairs
of conformations that are similar according to some metric, or to enumerate
all pairs of similar conformations. The direct approach of explicitly
evaluating the similarity measure for all pairs is quadratic in the number
of conformations, while most common similarity measures are quite costly
to compute even for a single pair of conformations. We have developed
a new approach that reduces the number of parameters to represent a conformation,
in order to both reduce the cost of evaluating similarity measures and
avoid the quadratic scaling of brute-force enumeration. First, we exploit
the fact that a protein is a long kinematic chain with relatively small
side-chains: we simplify the representation of the protein by breaking
its backbone of N atoms into M subsequences of length N/M each and replace
each subsequence by the mean of the atom centers. Hence, each conformation
maps into a much smaller space than the original conformational space.
We further compress the representation using standard SVD techniques.
These two simplification steps typically leave us with 10 to 20 parameters
per conformation (instead, of several 1000's). Finally, we apply classical
nearest-neighbor techniques (e.g., based on kd-trees) to this representation
to avoid the quadratic scaling. These techniques have been implemented
in a software module. Experimental results on various data sets of different
sizes have demonstrated speedup factors spanning several orders of magnitude,
with high correlation between the similarity measures on the exact and
simplified representations of the conformations.
|