Document Clustering with Open Source Tools
Document Clustering with Open Source Tools

Abstract: 

Document Clustering is a powerful application of Natural Language Processing techniques useful in topic modeling and extraction and information retrieval. A central challenge in document clustering is the scale of the data involved, making document clustering a prime candidate for development with the open source, distributed processing framework, Apache Spark. At the same time, tremendous strides have been made in the past few years in the development of other open source tools useful in NLP applications. This workshop will make heavy use of Apache Spark, but look at integrating these other tools (Gensim, Scikit-Learn) to the development process. Additionally, particular attention will be paid to NLP and document clustering as an application of Unsupervised Learning.

In this workshop, we will build a document clustering model over a corpus of documents and discuss the how and the why of the steps involved in building the model. The model will be built using Apache Spark and other tools such as Gensim. We will explore the latest in preprocessing techniques with open source extensions to Apache Spark, Scikit-Learn, and Gensim. We will explore topic modeling and how topic modeling can inform and be informed by document clustering. We will discuss useful metrics in building document clustering and topic models. Finally, we will show how to use a document clustering model to perform simple information retrieval.

The workshop will use both Python and Scala, but practitioners comfortable with Python only are welcome and will be comfortable. The workshop will implicitly look at using Amazon Web Services, especially S3, and Jupyterlab to perform the work. Some discussion may take place on leveraging Spark best practices (caching, view/table registration) in this work.

Bio: 

Joshua Cook has been teaching in one capacity or another for nearly fifteen years. He currently works as a Curriculum Developer for Databricks. Most recently, he taught Data Science for UCLA Extension. Prior to this he taught Data Science for General Assembly, in the Master of Education program at UCLA, high school mathematics at Crenshaw and Jefferson High Schools in Los Angeles, and early childhood literacy in West Oakland. Additionally, Joshua is trained as a computational mathematician. He has production experience with model prediction and deployment using the Python numerical stack and Amazon Web Services. He is the author of the book, Docker for Data Science, published by Apress Media.