Abstract: Building an efficient data pipeline to apply machine learning models in production has been a challenge for many data science practitioners and software engineers. While the model formats have been largely standardized, there is a great variety of data input sources that almost always require customized processing. On top of that is the streaming data inputs. A data pipeline architecture must be carefully implemented for reliable production deployment when data has to be consumed continuously. In TensorFlow 2.0, tf.data has been introduced as a canonical way of data processing for training and inference with tf.keras models. It simplifies data processing for static and streaming data sources, helping a lot for the production deployment of machine learning models.
In this tutorial, we will guide you through hands-on examples of integrating tf.keras model with different data input sources through tf.data in production environments, from simple csv/json files to SQL databases and to cloud data warehouse services such as Google Cloud BigQuery. We will also cover Apache Kafka as a data input to illustrate the streaming data pipeline architecture for continuous data processing with machine learning. As a bonus, attendees will learn the basics of distributed machine learning and its production usage.
Bio: Yong Tang is the Director of Engineering at MobileIron. His most recent focus is on data processing in machine learning. He is a maintainer and the SIG I/O lead of the TensorFlow project. He received the Open Source Peer Bonus Award from Google for his contributions to TensorFlow and is the author of the Kafka Dataset module in TensorFlow. In addition to TensorFlow, Yong Tang also contributes to many other projects for the open-source community. He is a maintainer of Docker, CoreDNS, and SwarmKit. Yong Tang received his PhD in Computer Science & Engineering at the University of Florida.