Metaclusters

Metaclusters (MCs) are the final output of the DPCfam algorithm. Each MC is composed by a collection of homologous protein sequence regions found in the UHGP-50 database (v. 1.0). Protein regions belonging to an MC are also known as seeds because they serve as a starting point for further analysis/experiments. In particular, they serve as seeds to build a profile-HMM for each MC, which is the current standard to represent protein families. The procedure to build the profiles was the following: we first pruned MCs seeds using CD-HIT v4.7 at 60 P.I., then we automatically built MSAs using MUSCLE which we finally used to build the profile-HMMs with HMMER - hmmbuild v.3.1b2. Original seeds, cdhitted seeds, MSAs, and HMMs are all available for download.

Aside from HMMs we also provide two sets of measures that help to characterize the content of each MC. The first set (left side of the table) inspects the content of each MC. The second set (right side of the table) refers to the content of each MC with respect to Pfam. With the latter, we wanted to answer how well Pfam maps to our Metaclusters.

Name Size UHGP-50 Avg. Len % LC % CC % DIS TM Pfam DA % DA Avg. Overlap Overlap type
Name Size UHGP-50 Avg. Len % LC % CC % DIS TM Pfam DA % DA Avg. Overlap Overlap type

List of abbreviations:

  • Name: Metacluster id.
  • Size UHGP-50: number of sequences belonging to MC in UHGP-50.
  • Avg. Len.: average length of sequences (amino acids count).
  • % LC: percentage of low complexity domains.
  • % CC: percentage of coiled coils domains.
  • % DIS: percentage of disordered domains.
  • TM: average number of transmembrane regions in a MC.
  • Pfam DA: Pfam architecture which best overlaps with MC. We refer to it as the dominant architecture (DA). If there is no overlap the MC is tagged as unknown (UNK).
  • % DA: percentage of sequences from MC in common with DA
  • Avg. Overlap: average overlap between the common sequences of the MC and the corresponding Pfam DA
  • Overlap type: classification of the overlap between MC and DA based on how well the boundaries of their sequences align.