When Your Big Data Seems Too Small: Accurate Inferences Beyond the Empirical Distribution
When Your Big Data Seems Too Small: Accurate Inferences Beyond the Empirical Distribution

Abstract: 

We will discuss several problems related to the challenge of making accurate inferences about a complex phenomenon, in the regime in which the amount of available data (i.e the sample size) is too small for the empirical distribution of the data to be an accurate representation of the phenomenon in question. This challenge arises in many settings involving high-dimensional data, or setting for which there are a large number of rarely observed elements (e.g. a significant fraction of words encountered in a typical text corpus occur only once or twice in the corpus, similarly for rare genetic mutations in genomic datasets, etc.). We show that for several fundamental and practically relevant settings, it is possible to ``denoise'' the empirical distribution of the data significantly. As a component of this approach, we describe how one can make accurate inferences about the "unseen" portion of the distribution, corresponding to events that were never observed in the given dataset. These techniques can be directly leveraged to predict the number of novel (previously unobserved) events that will likely be encountered in new batches of data. Finally, we will also discuss the problem of estimating the ``learnability'' of a dataset: given too little data to train an accurate model, it is often possible to estimate the extent to which a good model exists. Framed differently, even in the regime in which there is insufficient data to learn, one can still estimate the value of collecting or purchasing additional data. These problems have a number of practical applications across the sciences, and we will discuss several concrete applications to genomic and medical settings.

Bio: 

Gregory is an Assistant Professor in Stanford's Computer Science Department, working at the intersection of Algorithms, Machine Learning, Statistics, and Information Theory. One of the main themes of his my work is the design of efficient algorithms for accurately inferring information about complex distributions, given limited amounts of data, or limits on other resources such as the computation time, available memory or communication, or the quality of the available date. Prior to joining Stanford, I was a postdoc at Microsoft Research, New England, and received my PhD from Berkeley in Computer Science.