Abstract: Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data. In the proposed tutorial, we shall lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We shall also share case studies from the IoT, gaming, and healthcare as well as our experience operating these systems at internet scale at Twitter and Yahoo. We shall conclude by offering perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.
Bio: Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.