Removing unwanted variation in machine learning for personalized medicine

Machine Learning for Personalized Medicine will inevitably build on large omics datasets. These are often collected over months or years, and sometimes involve multiple labs. Unwanted variation (UV) can arise from technical elements such as batches, different platforms or laboratories, or from biological signals such as heterogeneity in age, ethnicity or cellular composition, which are unrelated to the factor of interest in the study. Similar issues arise when the goal is to combine several smaller studies. A very important task is to remove these UV factors without losing the factors of interest. Some years ago we proposed a general framework (called RUV) for removing UV in microarray data using negative control genes. It showed very good behavior for differential expression analysis (i.e., with a known factor of interest) when applied to several datasets. Our objective in this talk is to describe our recent results doing similar things in a machine learning context, specifically when carrying out classification.