About DPCfam

Characterizing the function of proteins is an extremely important step in understanding biological processes. This characterization can be performed, in principle, by dedicated experiments, both in a wet lab or through computational simulations which are overall expensive. The number of proteins that have been characterized with such methods is still relatively small. Instead, in the last decades the number of protein sequences known has grown exponentially especially thanks to high throughput genomic sequencing experiments.

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude and we expect this gap to increase over the next years. In this context, we introduce DPCfam [1][2], a new unsupervised procedure that uses sequence alignments and Density Peak Clustering [2] to automatically classify homologous protein regions . DPCfam shows potential both for assisting manual annotation efforts (domain discovery, detection of classification inconsistencies, improvement of family coverage and boosting of clan membership) and as a stand-alone tool for unsupervised classification of sparsely annotated protein datasets such as those from environmental metagenomics studies (domain discovery, analysis of domain diversity).

In its current version we ran the DPCfam algorithm to generate clusters on the UniRef50 database, which contains representatives of all known protein sequences (sharing less than 50% similarity). The classification is evolutionary accurate and it covers a significant fraction of known homologs annotated in Pfam [3]. Moreover, DPCfam suggests the classification of previously unknown regions, tagged as UNK, some of which have been already added to the latest version of the Pfam database.

We publish MCs with at least 50 seed sequence regions with an average sequence length larger than 50 a.a.s. In the Download section we also publish the DPCfam-B database, containing smaller (and less reliable) Metaclusters with at least 25 and up to 49 seed sequence regions, with an average sequence length larger than 50 a.a.s.

References

  1. Russo, ET, Barone F, Bateman A, Cozzini S, Punta M, Laio A. Dpcfam: Unsupervised protein family classification by density peak clustering of large sequence datasets.256 PLOS Comput. Biol. 18, 1–29, 2022 - DOI
  2. Russo ET, Laio A, Punta M. Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation. BMC Bioinformatics. 2021 Mar 12;22(1):121. doi: 10.1186/s12859-021-04013-x. PMID: 33711918; PMCID: PMC7955657 - DOI - PubMed
  3. Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014;344(6191):1492–1496. doi: 10.1126/science.1242072 -  DOI  -  PubMed
  4. Jaina Mistry, Sara Chuguransky, Lowri Williams, Matloob Qureshi, Gustavo A Salazar, Erik L L Sonnhammer, Silvio C E Tosatto, Lisanna Paladin, Shriya Raj, Lorna J Richardson, Robert D Finn, Alex Bateman, Pfam: The protein families database in 2021, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D412–D419 -  DOI  -  PubMed