E2P2 2.0 - Ensemble Enzyme Prediction Pipeline 2.0

The functional annotation of protein sequences was performed by the in-house Ensemble Enzyme Prediction Pipeline (E2P2, version 2.0). E2P2 annotates protein sequences using homology transfer by integrating both single sequence (BLAST, E-value cutoff <= 1e-30) and multiple sequence (Priam) models of enzymatic function. The ensemble algorithm relies on an average weighted integration scheme where the weight of each predicted model was determined via a 5-by-3 nested cross-validation routine. The training of E2P2 and the reference databases used in the annotation process are based on the Reference Protein Sequence Dataset (RPSD) 2.0. Data for RPSD was compiled from protein sequences with experiment support of existence from SwissProt, MetaCyc, and BRENDA.