Past Projects

Prediction of domain-domain interactions from protein-protein interactions

A vast majority of proteins must interact with other proteins to perform their intended functions. Proteins are made of functional modules known as domains that create the interface of an interaction through highly specific recognition events. Thus, knowledge on domain-domain interactions (DDIs) is very important for understanding the nature and the significance of protein-protein interactions (PPIs). Currently, the number of experimentally-known DDIs is very small, which warrants the development of computational inference methods for predicting functionally-significant DDIs.

We created a comprehensive, non-redundant dataset of 209,165 experimentally-derived PPIs by combining datasets from five major interaction databases. We introduced an integrated scoring system that uses a novel combination of a set of five orthogonal scoring features covering the probabilistic, evolutionary, evidence-based, spatial and functional properties of interacting domains, which can map the interacting propensity of two domains in many dimensions. This method outperforms similar existing methods both in the accuracy of prediction and in the coverage of domain interaction space. We predicted a set of 52,492 high-confidence DDIs to carry out cross-species comparison of DDI conservation in eight model species including human, mouse, Drosophila, C. elegans, yeast, Plasmodium, E. coli and Arabidopsis. Our results show that only 23% of these DDIs are conserved in at least two species and only 3.8% in at least 4 species, indicating a rather low conservation across species. Pair-wise analysis of DDI conservation revealed a 'sliding conservation' pattern between the evolutionarily neighboring species. Our methodology and the high-confidence DDI predictions generated in this study can help to better understand the functional significance of PPIs at the modular level, thus can significantly impact further experimental investigations in systems biology research.


Cumulative distribution of positive and negative test datasets against the entire range of prediction scores. iPfam and single domain datasets are positive domain-domain interaction datasets. About 85% of the positive data score above 8 and about the same number of negative data score below 8.

Published articles related to this project:

Back to top

Tracing the evolutionary origin of functional modules in the human proteome

The functional repertoire of the human proteome is an incremental collection of functions accomplished by protein domains evolved along the Homo sapiens lineage. Therefore, knowledge on the origin of these functionalities provides a better understanding of the domain and protein evolution in human. This study reports a unique approach for understanding the evolution of human proteome by tracing the origin of its constituting domains hierarchically, along the Homo sapiens lineage. The uniqueness of this method lies in subtractive searching of functional and conserved domains in the human proteome resulting in higher efficiency of detecting their origins. From these analyses the nature of protein evolution and trends in domain evolution can be observed in the context of the entire human proteome data. The method adopted here also helps delineate the degree of divergence of functional families occurred during the course of evolution.

This approach to trace the evolutionary origin of functional domains in the human proteome facilitates better understanding of their functional versatility as well as provides insights into the functionality of hypothetical proteins present in the human proteome. This work elucidates the origin of functional and conserved domains in human proteins, their distribution along the Homo sapiens lineage, occurrence frequency of different domain combinations and proteome-wide patterns of their distribution, providing insights into the evolutionary solution to the increased complexity of the human proteome.


Cartoon diagram of different representative proteins containing Pfam-A family EGF (epidermal growth factor) with remote homologs found at different nodes along the lineage using subtractive searching method. For each sequence, SWISS-PROT identifier is given and EGF domain is shown along with the node name where it has found its remote homolog in that protein sequence. The codes for different nodes are: B, bacteria; E, eukaryota; T, metazoa; C, chordata; M, mammalia; P, primates; H, Homo sapiens. Other functionally significant domain names in protein sequences are given in the legend.

Published articles on this project:

Back to top

Motif recognition in voltage-gated ion channel proteins

Voltage-gated ion channels (VGC) mediate selective diffusion of ions across cell membranes to enable many vital cellular processes. Three-dimensional structure data is virtually lacking for VGC proteins due to limitations in the crystallization of these mostly hydrophobic transmembrane proteins. Therefore, to better understand their function, there is a need to identify the conserved patterns using sequence analysis methods. VGC proteins assemble as functional tetramers from four monomer subunits in K+ ion channels or from four repeats of a single polypeptide in Ca2+ and Na+ channel sub-families. For Ca2+ and Na+ channel proteins, we generated profiles for each repeat and created profile-to-profile alignments for all repeats using a phylogenetic guide tree built from the consensus sequences of repeats. In this study, we identified several new conserved patterns specific to each transmembrane segment (TMS) of the voltage-sensing and the pore-forming modules in each sub-family. For Ca2+ and Na+, the functional theme of pattern conservation is similar in almost all segments while they differ with those of the K+ channel proteins, except in the S4 segment of voltage-sensing module. For each subfamily, we also identified residues conserved 50% or more in each TMS, their biological significance and disease associations in human.


Conserved motifs in the voltage-sensing module of calcium, sodium and potassium ion channel proteins. S1-S4 are transmembrane segments in the voltage-sensing module of VGC proteins

Published articles on this project:

Back to top

Alignment of multiple protein structures using Monte Carlo optimization

A global and comprehensive study of protein structures is possible only by comparison of multiple structures and investigation of their folding similarities and evolutionary relationships. With the availability of vast amounts of structural information, accurate and fully automated structural alignment algorithms are needed for a better understanding of sequence-structure-function relationships in proteins. Here, we present a new algorithm for the alignment of multiple protein structures using Monte Carlo optimization method. The algorithm uses pair-wise structural alignments as a starting point. Four different types of moves were designed to generate random changes in the alignment. A distance-based score is calculated for each trial move and moves are accepted or rejected based on the improvement in the alignment score until the alignment is converged. Initial tests on 66 protein structural families show promising results, the score increases by 69% on average. The increase in score is accompanied by an increase (12%) in the number of residue positions incorporated into the alignment. Two specific families, protein kinases and aspartic proteinases were tested and compared against curated alignments from HOMSTRAD and manual alignments. This algorithm has improved the overall number of aligned residues while preserving key catalytic residues.


Published articles on this project:

Back to top

Comparative genome analysis

We developed computational methods for comparative analysis of complete chloroplast genomes of solanaceous crop species and grass plant species. Specifically, we analyzed the inter-genomic spacer regions of these genomes in all-against-all fashion to compare and contrast the similarities and the differences.


Gene map of Tomato and Potato chloroplast genome: Comparative analysis

Published articles on this project:

Back to top

Phylogenic analysis of proteins based on domain structure

The Rho family of small GTPases are important regulators of multiple cellular activities and, most notably, reorganization of the actin cytoskeleton. Dbl-homology (DH)-domain-containing proteins are the classical guanine nucleotide exchange factors (GEFs) responsible for activation of Rho GTPases. However, members of a newly discovered family can also act as Rho-GEFs. These CZH proteins include: CDM (Ced-5, Dock180 and Myoblast city) proteins, which activate Rac; and zizimin proteins, which activate Cdc42. The family contains 11 mammalian proteins and has members in many other eukaryotes. The GEF activity is carried out by a novel, DH-unrelated domain named the DOCKER, CZH2 or DHR2 domain. CZH proteins have been implicated in cell migration, phagocytosis of apoptotic cells, T-cell activation and neurite outgrowth, and probably arose relatively early in eukaryotic evolution.


Phylogenetic analysis of the CZH1 and CZH2 domains. Multiple alignments were built using the CLUSTALW program and the distances between all pairs of sequences in the multiple alignment were determined. Phylogenetic trees were generated using The Neighbor-Joining method and trees were drawn using the TREEVIEW program. Scale bar represents 0.1 nucleotide substitutions per site.

Published articles on this project:

Back to top