|
|
This concept was described by Tatusov, et. al. as a technique for identifying groups of genes in completely sequenced genomes that may be related by evolution and have proven a popular tool for comparative genome studies, including function prediction, phylogenetic classification, and ancestor studies. The online COG database is a useful, and growing, resource.
In a graph whose edges represent mutually best-fit genes (BeTs), one forms COGs by merging 3-cliques that share edges, and reporting the resulting sets of vertices. This original clustering procedure was run on 7 genomes from 5 clades, families of related organisms. As more genomes are added, however, it is natural to ask for COGs formed from m-cliques that share (n<m)-cliques in order to find highly-conserved orthologs. Montague and Hutchinson (2000) defined these "higher stringency" COGs to find highly-conserved orthologs in the herpesvirus.
As the number of genomes increases, computing high stringency COGs poses a computational challenge. In fact, the problem is NP-hard. We observe that the chief obstruction in practice is related to paralogs, which produce subgraphs containing many cliques. In fact, these subgraphs resemble Moon-Moser graphs [1], which contain the largest possible number of cliques per node. By recognizing and exploiting these extremal graphs, however, we obtain an algorithm to compute COGs that is orders of magnitude faster than clique enumeration. Computation on 20 genomes required 3 hours and 58 minutes by clique enumeration, but only 11 seconds with the algorithm we have developed.
[1] Moon, J.W., and Moser, L. On cliques in graphs. Israel J. Math. 3 (1965), 23-28.