Metaclusters

Metaclusters (MCs) are the final output of the DPCfam algorithm. Each MC is composed by a collection of homologous protein sequence regions found in the UniRef50 database (v. 201707). Protein regions belonging to an MC are also known as seeds because they are a starting point for further analysis/experiments. In particular, they serve as seeds to build a profile-HMM for each MC, which is the current standard to represent protein families. The procedure to build the profiles was the following: we first pruned MCs seeds using CD-HIT v4.7 at 60 P.I., then we automatically built MSAs using MUSCLE which we finally used to build the profile-HMMs with HMMER - hmmbuild v.3.1b2. Original seeds, cdhitted seeds, MSAs, and HMMs are all available for download.

Aside from HMMs we also provide two sets of measures that help to characterize the content of each MC. The first set (left side of the table) inspects the content of each MC. The second set (right side of the table) refers to the content of each MC with respect to Pfam. With the latter, we wanted to answer how well Pfam maps to our Metaclusters.

Name Size Uni50 Avg. Len % LC % CC % DIS TM Pfam DA % DA Size Uni50-UniKB Overlap type
Name Size Uni50 Avg. Len % LC % CC % DIS TM Pfam DA % DA Size Uni50-UniKB Overlap type

List of abbreviations:

  • Name: Metacluster id.
  • Size Uniref50: number of sequences belonging to MC in Uniref50.
  • Avg. Len.: average length of sequences (amino acids count).
  • % LC: percentage of low complexity domains.
  • % CC: percentage of coiled coils domains.
  • % DIS: percentage of disordered domains.
  • TM: average number of transmembrane regions in a MC.
  • Pfam DA: Pfam architecture which best overlaps with MC. We refer to it as the dominant architecture (DA). If there is no overlap the MC is tagged as unknown (UNK).
  • % DA: percentage of sequences from MC in common with DA (computed considering sequences in UniRef50&UniprotKB intersection)
  • Size Uni50-UniKB-inter: number of sequences belonging to MC in UniRef50&UniprotKB intersection.
  • Overlap type: classification of the overlap between MC and DA based on how well the boundaries of their sequences align.