| Clustering is an important technique for classifying
and indexing protein structures and their substructures. Many practical
methods proposed in the last few years give equal importance to all the
dimensions when computing the distance between two points. While such
approaches have proven successful for low-dimensional datasets, their
accuracy and/or efficiency decrease significantly in higher dimensional
spaces. The reason for this performance deterioration is the so-called
dimensionality curse. Recent research shows that for moderate-to-high
dimensional spaces (tens or hundreds of dimensions), a full-dimensional
distance is often irrelevant. Recognizing the need for increased flexibility
in reducing the data dimensionality, recent database research has proposed
computing projective clusters, in which points that are closely correlated
in some subspace are grouped together. Instead of projecting the entire
dataset on a single subspace, these methods project each cluster on its
associated subspace, which is generally different from the subspace associated
with another cluster.
We propose a mathematical formulation for the notion
of optimal projective cluster, starting from natural requirements on the
density of points in subspaces. This allows us to develop a Monte Carlo
algorithm for iteratively computing projective clusters. We prove that
the computed clusters are good with high probability. We implemented a
modified version of the algorithm, using heuristics to speed up computation.
Our extensive experiments show that our method is significantly more accurate
than previous approaches.
|