Benchmarks and Process Management in Data Science: Will We Ever Get Over this Mess?

This panel aims to address areas that are widely acknowledged to be of critical importance to the success of Data Science projects and to the healthy growth of KDD/Data Science as a field of scientific research. However, despite this acknowledgement of their criticality, these areas receive insufficient attention in the major conferences in the field. Furthermore, there is a lack of actual actions and tools to address these areas in actual practice. These areas are summarized as follows: 1. Ask any data scientist or machine learning practitioner what they spend the majority of their time working on, and you will most likely get an answer that indicates that 80% to 90% of their time is spent on "Data Chasing", "Data Sourcing", "Data Wrangling", "Data Cleaning" and generally what researchers would refer to-often dismissively-as "Data Preparation". The process of producing statistical or data mining models from data is typically "messy" and certainly lacks management tools to help manage, replicate, reconstruct, and capture all the knowledge that goes in 90% of activities of a Data Scientists. The intensive Data Engineering work that goes into exploring and determining the representation of problem and the significant amount of "data cleaning" that ensues creates a plethora of extracts, files, and many artifacts that are only meaningful to the data scientist. 2. The severe lack of Benchmarks in the field, especially ones at big data scale is an impediment to true, objective, measurable progress on performance. The results of each paper are highly dependent on the large degree of freedom an author has on configuring competitive models and on determining which data sets to use (often Data that is not available to others to replicate results on) 3. Monitoring the health of models in production, and deploying models into production environments efficiently and effectively is a black art and often an ignored area. Many models are effectively "orphans" with no means of getting appropriate health monitoring. The task of deploying a built model to production is frequently beyond the capabilities of a Data Scientists and the understanding of the IT team.