The Expense of Poorly Labeled Data. What Causes ML Models to Break?
The Expense of Poorly Labeled Data. What Causes ML Models to Break?

Abstract: 

Often times in data science projects, data are provided and the quality of the data can be taken for granted. The degree to which bad quality data adversely affects the model is frequently not quantified, and data and label generation are a very important part of the overall machine learning process.

In this talk we discuss and quantify the effects of building a machine learning model with incorrectly labeled data. We will dive into a real world example and explore what happens to a model’s predictive power when the model’s training labels are distorted, using a publicly available health care data set. Specifically two types of distortions are explored. First, we randomly switch the labels across the entire data set. Then we systematically switch the labels in a biased way based on features of the data set. As the proportion of incorrect labels increases, we visualize and quantify the effect on the model’s accuracy. Additionally, we also observe how feature importance and significance is affected when the labels are distorted.

This talk is relevant to anyone wanting to see and quantify how bad data can greatly affect a model’s usefulness.

Bio: 

Nikhil Kumar is currently the senior data scientist in the research division at Alegion. Nikhil’s work includes building machine learning models to support Alegion’s customer's use cases, the management of A/B testing product features and analytics around use case performance, and researching new methodologies to implement and improve business growth. Prior to Alegion, Nikhil worked at various consulting firms as well as other tech start-ups building enterprise data science solutions. As a researcher, his interests include applying reinforcement learning to facilitate how experimentation is done, developing methods to measure worker scoring and reputation, as well as identifying how machine learning modeling could be used to predict key business metrics, something he is passionate about. He also co-founded and co-hosts “No Bias,” a podcast on machine learning and AI. Nikhil holds a master’s degree in statistics from Carnegie Mellon University, where he also received his bachelor's degree.