Nearest-neighbor search among large sets of protein conformations
(Latombe; Lotan, Schwarzer)

Protein simulations often yield many (100,000's or more) sampled conformations of a protein. An important problem is to find pairs of conformations that are similar according to some metric, or to enumerate all pairs of similar conformations. The direct approach of explicitly evaluating the similarity measure for all pairs is quadratic in the number of conformations, while most common similarity measures are quite costly to compute even for a single pair of conformations. We have developed a new approach that reduces the number of parameters to represent a conformation, in order to both reduce the cost of evaluating similarity measures and avoid the quadratic scaling of brute-force enumeration. First, we exploit the fact that a protein is a long kinematic chain with relatively small side-chains: we simplify the representation of the protein by breaking its backbone of N atoms into M subsequences of length N/M each and replace each subsequence by the mean of the atom centers. Hence, each conformation maps into a much smaller space than the original conformational space. We further compress the representation using standard SVD techniques. These two simplification steps typically leave us with 10 to 20 parameters per conformation (instead, of several 1000's). Finally, we apply classical nearest-neighbor techniques (e.g., based on kd-trees) to this representation to avoid the quadratic scaling. These techniques have been implemented in a software module. Experimental results on various data sets of different sizes have demonstrated speedup factors spanning several orders of magnitude, with high correlation between the similarity measures on the exact and simplified representations of the conformations.