Explaining XGBoost Models – Tools and Methods

Abstract: There is a widespread belief that the twin modeling goals of prediction and explanation are in conflict. That is, if one desires superior predictive power, then by definition one must pay a price of having little insight into how the model made its predictions. Conversely, if one desires explanations then one must only use """"highly interpretable"""" methods like linear and logistic regression. However, in reality, this tradeoff is by no means a given. In fact, methods with high predictive power, when examined properly with sophisticated tooling, can yield practical insights that could never be realized by high bias methods like linear and logistic regression. Furthermore, the insights gained by carefully examining a model can be used to suggest better features, thereby improving model performance. Thus, the twin goals of prediction and understanding can instead form a virtuous cycle rather than remaining in conflict.

In this workshop, we will work hands-on using XGBoost with real-world data sets to demonstrate how to approach data sets with the twin goals of prediction and understanding in a manner such that improvements in one area yield improvements in the other. Using modern tooling such as Individual Conditional Expectation (ICE) plots and SHAP, as well as a sense of curiosity, we will extract powerful insights that could not be gained from simpler methods. In particular, attention will be placed on how to approach a data set with the goal of understanding as well as prediction.

Bio: Brian Lucena is Principal at Lucena Consulting and a consulting Data Scientist at Agentero. An applied mathematician in every sense, he is passionate about applying modern machine learning techniques to understand the world and act upon it. In previous roles he has served as SVP of Analytics at PCCI, Principal Data Scientist at Clover Health, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.