E2P2 1.0 - Ensemble Enzyme Prediction Pipeline 1.0

The functional annotation of protein sequences was performed by the in-house Ensemble Enzyme Prediction Pipeline (E2P2, version 1.0). E2P2 systematically integrates results from three molecular function annotation algorithms using an ensemble classification scheme. For a given genome, all protein sequences are submitted as individual queries against the base-level annotation methods. The individual methods rely on homology transfer to annotate protein sequences, using single sequence (BLAST, E-value cutoff <= 1e-30, subset of SwissProt 15.3) and multiple sequence (Priam, November 2010; CatFam, version 2.0, 1% FDR profile library) models of enzymatic functions. The base-level predictions are then integrated into a final set of annotations using an average weighted integration algorithm, where the weight of each prediction from each individual method was determined via a 0.632 bootstrap process over 1000 rounds of testing. The training and testing data for E2P2 1.0 and the BLAST reference database were drawn from RPSD 1.0 (Reference Protein Sequence Database 1.0). These highly trusted protein sequences were obtained from SwissProt release 15.3. Attempts were made to limit the dataset to proteins that have experimental support of their existence.