Identification of substructural motifs characteristic of protein structural families
(Tropsha, Snoeyink; Prins, Wang)

In collaboration with Wang, Prins, and Snoeyink, we have initiated the development of an efficient subgraph mining technique and its application to finding characteristic substructural patterns within protein structural families. In our method, protein structures are represented by graphs where the nodes are residues and the edges connect residues found within certain distance from each other. Application of subgraph mining to proteins is challenging for a number reasons: (1) protein graphs are large and complex, (2) current protein databases are large and continue to grow rapidly, and (3) only a small fraction of the frequent subgraphs among the huge pool of all possible subgraphs could be significant in the context of protein classification. We have conducted an experimental study in which all subgraphs were identified in several protein structural families annotated in the SCOP database. The Support Vector Machine algorithm was used to classify proteins from different families under the binary classification scheme. We found that this approach identifies spatial motifs unique to individual SCOP families and affords excellent discrimination between families. Additional studies in Year 5 will concentrate on expanding this approach to all major structural families in the protein data bank.