Active and guided learning of enzyme function
en
0.25
0.5
0.75
1.25
1.5
1.75
2
Manual annotation cannot keep up with enzyme sequence discovery. In this work, we modelled the use of active and guided learning to support enzyme function curation. We evaluated, on 5,750 E. coli proteins, nine strategies to sort instances for curation. We found that selecting sets of InterPro features in order of frequency of occurrence can cut the curation effort by almost two thirds, while maintaining very high accuracy and recall. The method can be applied to real-life datasets of millions of proteins thanks to its limited computational requirements, parallelisation, good coverage of rare classes and flexibility in selecting instances for annotation.