Training & Workshop Sessions

– Taught by World-Class Data Scientists –

Learn the latest data science concepts, tools, and techniques from the best. Forge a connection with these rock stars from industry and academia, who are passionate about molding the next generation of data scientists.

Get  hands-on training from leading data science instructors

Train with the best of the best. Our training instructors are the highly experienced in machine learning, deep learning, and other data science topic areas and drawn from industry and academia

Confirmed Sessions for West 2019 Include:

  • Understanding the PyTorch Framework with Applications to Deep Learning
  • Reinforcement Learning with TF-Agents & TensorFlow 2.0: Hands on
  • Data Storytelling Workshop
  • Network Analysis Made Simple
  • Advanced Machine Learning with scikit-learn
  • Deciphering the Black Box: Latest Tools and Techniques for Interpretability
  • Advanced Methods for Explaining XGBoost Models
  • Causal Inference for Data Science
  • Introduction to Machine Learning
  • Advanced methods for working with missing data in supervised machine learning
  • Intermediate Machine Learning with scikit-learn
  • Deciphering the Black Box: Latest Tools and Techniques for Interpretability
  • Intermediate Machine Learning in R
  • Introduction to RMarkdown in Shiny
  • Causal Inference for Data Science
  • Healthcare NLP with a doctor’s bag of notes
  • Fast and flexible probabilistic modeling in Python
  • Introduction to Machine Learning in R

Training & Workshops Sessions

ODSC West 2019 will host training and workshop session on some of the latest and in-demand technique, models and frameworks including:

Training Focus Areas

  • Deep Learning and Reinforcement Learning

  • Machine Learning and Transfer Learning and Adversarial Learning 

  • Computer Vision

  • NLP, Speech, and Text Anaytics

  • Data Visualization

Quick Facts

  • Choose from 40 Training sessions

  • Chose from 50 workshops

  • Hands-on training session are 4 hours in duration

  • Workshops and tutorial are 2 hours in duration

Frameworks

  • TensorFlow, PyTorch, and MXNet

  • Scikit-learn, PyMC3, Pandas, Theano, NLTK, NumPy, SciPy

  • Kera, Apache Spark, Apache Storm, Airflow, Apache Kafka

  • Kubernetes, Kubeflow, Apache Ignite, Hadoop

West 2019 Confirmed Instructors

Training Sessions

More sessions added weekly

Training: Apache Spark & Your Favorite Python Tools: Working Together for Fast Data Science at Scale

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, scoring (inference), and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark…more details

Instructor's Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s 20 years of engineering experience include streaming analytics, machine learning systems, and cluster management schedulers for some of the world’s largest banks, along with web, mobile, and embedded device apps for startups. His first full-time job in tech was on a neural-net-based fraud detection system for debit transactions, back in the bad old days when some neural nets were patented (!) and he’s much happier living in the age of amazing open-source data and ML tools today.

Adam Breindel

Apache Spark Expert, Data Science Instructor and Consultant

training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Developer of scikit-learn at Columbia Data Science Institute

training: Synthesizing Data Visualization and User Experience

The wealth of data available offers unprecedented opportunities for discovery and insight. How do we design a more intuitive and useful data experience? This workshop focuses on approaches to turn data into actionable insights by combining principles from data visualization and user experience design. Participants will be asked to think holistically about data visualizations and the people they serve. Through presentations and hands-on exercises, participants will learn how to choose and create data visualizations driven by user-oriented objectives.

Instructor's Bio

Mark Schindler is co-founder and Managing Director of GroupVisual.io. For over 15 years, he has designed user-interfaces for analytic software products and mobile apps for clients ranging from Fortune 50 companies to early-stage startups. In addition to design services, Mark and his team mentor startup companies and conduct workshops on data visualization, analytics and user-experience design.

Mark Schindler

Co-founder, Managing Director at GroupVisual.io

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor's Bio

Lukas Biewald is a co-founder and CEO of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Lukas also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Chris Van Pelt. Prior to co-founding Weights and Biases and CrowdFlower, Biewald was a Senior Scientist and Manager within the Ranking and Management Team at Powerset, a natural language search technology company later acquired by Microsoft. From 2005 to 2006, Lukas also led the Search Relevance Team for Yahoo! Japan.

Lukas Biewald

Founder at Weights & Biases

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor's Bio

Chris Van Pelt is a co-founder of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Chris also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Lukas Biewald.

Chris Van Pelt

Co-founder at Weights & Biases

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor's Bio

Stacey Svetlichnaya is deep learning engineer at Weights & Biases in San Francisco, CA, helping develop effective tools and patterns for deep learning. Previously a senior research engineer with Yahoo Vision & Machine Learning, working on image aesthetic quality and style classification, object recognition, photo caption generation, and emoji modeling. She has worked extensively on Flickr image search and data pipelines, as well as automating content discovery and recommendation. Prior to Flickr, she helped build a visual similarity search engine with LookFlow, which Yahoo acquired in 2013. Stacey holds a BS ‘11 and MS ’12 in Symbolic Systems from Stanford University.

Stacey Svetlichnaya

Deep Learning Engineer at Weights & Biases

training: Machine Learning in R, Part I

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages…more details

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.
He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Chief Data Scientist, Author of R for Everyone, Professor at Lander Analytics, Columbia Business School

training: Machine Learning in R, Part II

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages…more details

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.
He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Chief Data Scientist, Author of R for Everyone, Professor at Lander Analytics, Columbia Business School

Training: Intermediate Machine Learning with Scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Developer of scikit-learn at Columbia Data Science Institute

Training: All The Cool Things You Can Do With PostgreSQL To Next Level Your Data Analysis

The intention of this VERY hands on workshop is to get you introduced and playing with some of the great features you never knew about in PostgreSQL. You know, and probably already love, PostgreSQL as your relational database. We will show you how you can forget about using ElasticSearch, MongoDB, and Redis for a broad array of use cases. We will add in some nice statistical work with R embedded in PostgreSQL. Finally we will bring this all together using the gold standard in spatial databases, PostGIS. Unless you have a specialized use case, PostgreSQL is the answer. The session will be very hands on with plenty of interactive exercises.

By the end of the workshop participants will leave with hands on experience doing:
Spatial Analysis
JSON search
Full Text Search
Using R for stored procedures and functions
All in PostgreSQL

Instructor's Bio

Steve is a Partner, and Director of Developer Relations for Crunchy Data (PostgreSQL people). He goes around and shows off all the great work the PostgreSQL community and Crunchy Committers do. He can teach you about Data Analysis with PostgreSQL, Java, Python, MongoDB, and some JavaScript. He has deep subject area expertise in GIS/Spatial, Statistics, and Ecology. He has spoken at over 75 conferences and done over 50 workshops including Monktoberfest, MongoNY, JavaOne, FOSS4G, CTIA, AjaxWorld, GeoWeb, Where2.0, and OSCON. Before Crunchy Data, Steve was a developer evangelist for DigitalGlobe, Red Hat, LinkedIn, deCarta, and ESRI. Steve has a Ph.D. in Ecology.

Steven Pousty, PhD

Director of Developer Relations at Crunchy Data

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content…more details

Instructor's Bio

Jordan Bakerman holds a Ph.D. in statistics from North Carolina State University. His dissertation centered on using social media to forecast real world events, such as civil unrest and influenza rates. As an intern at SAS, Jordan wrote the SAS Programming for R Users course for students to efficiently transition from the R to SAS using a cookbook style approach. As an employee, Jordan has developed courses demonstrating how to integrate open source software within SAS products. He is passionate about statistics, programming, and helping others become better statisticians.

Jordan Bakerman, PhD

Sr. Analytical Training Consultant at SAS

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content…more details

Instructor's Bio

Ari holds bachelor’s degrees in both physics and mathematics from UNC-Chapel Hill. His research focused on collecting and analyzing low energy physics data to better understand the neutrino. Ari taught introductory and advanced physics and scientific programming courses at UC-Berkeley while working on a master’s in physics with a focus on nonlinear dynamics. While at SAS, Ari has worked to develop courses that teach how to use Python code to control SAS analytical procedures.

Ari Zitin

Analytical Training Consultant at SAS

Training: Designing Modern Streaming Data Applications

Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results…more details

Instructor's Bio

Until recently, Arun Kejariwal was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Arun Kejariwal

Independent

Training: Designing Modern Streaming Data Applications

Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results…more details

Instructor's Bio

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Karthik Ramasamy, PhD

Co-founder and CEO at Sreamlio

Training: Understanding the PyTorch Framework with Applications to Deep Learning

Over the past couple of years, PyTorch has been increasing in popularity in the Deep Learning community. What was initially a tool for Deep Learning researchers has been making headway in industry settings.

In this session, we will cover how to create Deep Neural Networks using the PyTorch framework on a variety of examples. The material will range from beginner – understanding what is going on “”under the hood””, coding the layers of our networks, and implementing backpropagation – to more advanced material on RNNs,CNNs, LSTMs, & GANs.

Attendees will leave with a better understanding of the PyTorch framework. In particular, how it differs from Keras and Tensorflow. Furthermore, a link to a clean documented GitHub repo with the solutions of the examples covered will be provided.

Instructor's Bio

Robert loves to break deep technical concepts down to be as simple as possible, but no simpler.

Robert has data science experience in companies both large and small. He is currently Head of Data Science for Podium Education, where he builds models to improve student outcomes, and an Adjunct Professor at Santa Clara University’s Leavey School of Business. Prior to Podium Education, he was a Senior Data Scientist at Metis teaching Data Science and Machine Learning. At Intel, he tackled problems in data center optimization using cluster analysis, enriched market sizing models by implementing sentiment analysis from social media feeds, and improved data-driven decision making in one of the top 5 global supply chains. At Tamr, he built models to unify large amounts of messy data across multiple silos for some of the largest corporations in the world. He earned a PhD in Applied Mathematics from Arizona State University where his research spanned image reconstruction, dynamical systems, mathematical epidemiology and oncology.

Robert Alvarez, PhD

Head of Data Science at Podium Education

training: Introduction to RMarkdown in Shiny

Markdown Primer (45 minutes): Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections, Include Table of Contents
Integrate R Code (30 minutes): Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching
Build RMarkdown Slideshows (20 minutes): Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode
Develop Flexdashboards (30 minutes): Start with the Flexdashboard Layout, Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code…more details

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.
He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Chief Data Scientist, Author of R for Everyone, Professor at Lander Analytics, Columbia Business School

training: Reinforcement Learning with TF-Agents & TensorFlow 2.0: Hands On

In this workshop you will discover how machines can learn complex behaviors and anticipatory actions. Using this approach autonomous helicopters fly aerobatic maneuvers and even the GO world champion was beaten with it. A training dataset containing the “right” answers is not needed, nor is “hard-coded” knowledge. The approach is called “reinforcement learning” and is almost magical.

Using TF-Agents on top of TensorFlow 2.0 we will see how a real-life problem can be turned into a reinforcement learning task. In an accompanying Python notebook, we implement – step by step – all solution elements, highlight the design of Google’s newest reinforcement learning library, point out the role of neural networks and look at optimization opportunities…more details

Instructor's Bio

Oliver Zeigermann is a developer and consultant from Hamburg, Germany. He has been involved with AI since his studies in the 90s and has written several books and has recently published the “Deep Learning Crash Course” with Manning. More on http://zeigermann.eu/

Oliver Zeigermann

Consultant at embarc / bSquare

training: Reinforcement Learning with TF-Agents & TensorFlow 2.0: Hands On

In this workshop you will discover how machines can learn complex behaviors and anticipatory actions. Using this approach autonomous helicopters fly aerobatic maneuvers and even the GO world champion was beaten with it. A training dataset containing the “right” answers is not needed, nor is “hard-coded” knowledge. The approach is called “reinforcement learning” and is almost magical.

Using TF-Agents on top of TensorFlow 2.0 we will see how a real-life problem can be turned into a reinforcement learning task. In an accompanying Python notebook, we implement – step by step – all solution elements, highlight the design of Google’s newest reinforcement learning library, point out the role of neural networks and look at optimization opportunities…more details

Instructor's Bio

Christian is a consultant at bSquare with a focus on machine learning & .net development. He has a PhD in computer algebra from ETH Zurich and did a postdoc at UC Berkeley where he researched online data mining algorithms. Currently he applies reinforcement learning to industrial hydraulics simulations.

Christian Hidber, PhD

Consultant at bSquare

Training: Advanced Machine Learning with Scikit-learn, Part I

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, advanced model evaluation, feature engineering and working with imbalanced datasets. We will also work with text data using the bag-of-word method for classification.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes some familiarity with the API of scikit-learn and how to do cross-validations and grid-search with scikit-learn.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Developer of scikit-learn at Columbia Data Science Institute

Training: Advanced Machine Learning with Scikit-learn, Part II

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some advanced topics in using scikit-learn, such as how to perform out-of-core learning with scikit-learn and how to speed up parameter search. We’ll also cover how to build your own models or feature extraction methods that are compatible with scikit-learn, which is important for feature extraction in many domains. We will see how we can customize scikit-learn even further, using custom methods for cross-validation or model evaluation.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes experience using scikit-learn and familiarity with the API.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Developer of scikit-learn at Columbia Data Science Institute

training: From Numbers to Narrative: Turning Raw Data into Compelling Stories with Impact

Humans evolved to tell stories. In fact, we evolved to rely on story for our most important learning. Some argue story was more important for the survival of the species than opposable thumbs.

In this half-day workshop, learn how to take a step back from your data and think like a storyteller. Learn some key ideas and techniques to turn your numbers into a narrative – to make a compelling story that will have an impact on your audience. We will cover practical, actionable ideas that will make your next effort at communicating with data much more powerful…more details

Instructor's Bio

Bill is an information designer, helping clients turn their data into compelling visual and often interactive experiences. Project and workshop clients include the World Bank, United Nations, International Monetary Fund, Starbucks, American Express, PricewaterhouseCoopers, Facebook, and the City of Boston. He is the founder of Beehive Media, a Boston-based data visualization and information design consultancy. Bill teaches data storytelling, information design and data visualization on LinkedIn Learning & Lynda.com and in workshops around the world. Bill has a new keynote talk about how you can wield outsider attributes to wield influence beyond your role within your organization. Ask him about it!

Bill Shander

Founder at Beehive Media

training: Intermediate RMarkdown in Shiny

Markdown Primer (45 minutes): Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections, Include Table of Contents
Integrate R Code (30 minutes): Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching
Build RMarkdown Slideshows (20 minutes): Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode
Develop Flexdashboards (30 minutes): Start with the Flexdashboard Layout, Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code…more details

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.
He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Chief Data Scientist, Author of R for Everyone, Professor at Lander Analytics, Columbia Business School

training: Network Analysis Made Simple

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems – as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

Instructor's Bio

Eric is an Investigator at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Eric Ma, PhD

Author of nxviz Package

training: Adapting Machine Learning Algorithms to Novel Use Cases

How can an idea from an 18th century Presbyterian minister be used to estimate the mass density function of galaxies across the Universe? How can a marketing segmentation algorithm protect astronauts traveling to Mars from certain death? How does a Formula 1 race from then 1950’s inspire one of the greatest data science use cases for the Internet of Things? How can a violation of the triangle inequality theorem in mathematics lead to a cure for cancer? This workshop will answer these questions, and more, by presenting several examples of one of the key aptitudes of successful data science practice, which is adaptability. In particular, I will present several well known algorithms (including some that we would not even call “algorithms”) that may have been adopted for specific use cases or applied in specific business domains, and then I will show how each one can be adapted to a novel use case that may be less obvious, perhaps producing significantly surprising results in some other domain…more details

Instructor's Bio

Eric is an Investigator at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Dr. Kirk Borne

Principal Data Scientist at Booz Allen Hamilton

Training: Deep Learning (with TensorFlow 2.0)

Relatively obscure a few short years ago, Deep Learning is ubiquitous today across data-driven applications as diverse as machine vision, natural language processing, and super-human game-playing.

This Deep Learning primer brings the revolutionary machine-learning approach behind contemporary artificial intelligence to life with interactive demos featuring TensorFlow 2.0, the major, cutting-edge revision of the world’s most popular Deep Learning library…more details

Instructor's Bio

Jon Krohn is Chief Data Scientist at the machine learning company untapt. He presents an acclaimed series of tutorials published by Addison-Wesley, including Deep Learning with TensorFlow and Deep Learning for Natural Language Processing. Jon teaches his deep learning curriculum in-classroom at the New York City Data Science Academy and guest lectures at Columbia University. He holds a doctorate in neuroscience from the University of Oxford and, since 2010, has been publishing on machine learning in leading peer-reviewed journals. His book, Deep Learning Illustrated, is being published by Pearson in 2019.

Dr. Jon Krohn

Chief Data Scientist at Untapt, Author of Deep Learning Illustrated

Training: Hands-On Introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor's Bio

Lukas Biewald is a co-founder and CEO of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Lukas also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Chris Van Pelt. Prior to co-founding Weights and Biases and CrowdFlower, Biewald was a Senior Scientist and Manager within the Ranking and Management Team at Powerset, a natural language search technology company later acquired by Microsoft. From 2005 to 2006, Lukas also led the Search Relevance Team for Yahoo! Japan.

Lukas Biewald

Founder at Weights & Biases

Training: Hands-On Introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor's Bio

Chris Van Pelt is a co-founder of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Chris also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Lukas Biewald.

Chris Van Pelt

Co-founder at Weights & Biases

From Stored Data To Data Stories: Building Data Narratives With Open-source Tools

Literate computing weaves a narrative directly into an interactive computation. Text, code, and results are combined into a narrative that relies equally on textual explanations and computational components. Insights are extracted from data using computational tools. These insights are communicated to an audience in the form of a narrative that resonates with the audience. Literate computing lends itself to the practice of reproducible research. One may re-run the analyses; run the analyses with new data sets; modify the code for other purposes.

This workshop will take one through the steps associated with literate computing: data retrieval; data curation; model construction, evaluation, and selection; and reporting. Particular attention will be paid to reporting, i.e., building a narrative. Examples will be presented demonstrating how one might generate multiple output formats (e.g., HTML pages, presentation slides, PDF documents) starting with the same code base…more details

Instructor's Bio

Paul Kowalczyk is a Senior Data Scientist at Solvay. There, Paul uses a variety of toolchains and machine learning workflows to visualize, analyze, mine, and report data; to generate actionable insights from data. Paul is particularly interested in democratizing data science, working to put data products into the hands of his colleagues. His experience includes using computational chemistry, cheminformatics, and data science in the biopharmaceutical and agrochemical industries. Paul received his PhD from Rensselaer Polytechnic Institute, and was a Postdoctoral Research Fellow with IBM’s Data Systems Division.

Paul Kowalczyk, PhD

Senior Data Scientist at Solvay

Training: Hands-On Introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor's Bio

Stacey Svetlichnaya is deep learning engineer at Weights & Biases in San Francisco, CA, helping develop effective tools and patterns for deep learning. Previously a senior research engineer with Yahoo Vision & Machine Learning, working on image aesthetic quality and style classification, object recognition, photo caption generation, and emoji modeling. She has worked extensively on Flickr image search and data pipelines, as well as automating content discovery and recommendation. Prior to Flickr, she helped build a visual similarity search engine with LookFlow, which Yahoo acquired in 2013. Stacey holds a BS ‘11 and MS ’12 in Symbolic Systems from Stanford University.

Stacey Svetlichnaya

Deep Learning Engineer at Weights & Biases

Training: Building Web Applications in R Using Shiny

Shiny is an R package that can be used to build interactive web pages with R. This might sound strange or scary, but you don’t need to have any web knowledge – it’s just R! If you’ve ever written an analysis in R and you want to make it interactive, you can use Shiny. If you’ve ever written a function or model that you want to share with others who don’t know how to use R, you can use Shiny. Shiny has many use cases, and this course will help you see how you can leverage it in your own work. In this workshop, you’ll learn how to take a Shiny app from start to finish – we’ll start by building a simple Shiny app to interactively visualize a dataset, and deploy it online to make it accessible to the world.

Instructor's Bio

Dean Attali is the founder of AttaliTech Ltd, a consulting firm that specializes in providing world-class expertise and training in R-Shiny. Dean studied Computer Science at the University of Waterloo, Canada, and has 10+ years of experience as a software engineer in both large companies (Google, IBM) and small startups (Wish.com, Tagged.com). After getting a good taste of the San Francisco tech life, Dean was curious to see what academia had to offer and went on to pursue a graduate degree in Bioinformatics at the University of British Columbia in Vancouver, Canada. He became involved with Shiny since its early days and quickly developed a deep passion for R-Shiny. Dean is the author of many popular Shiny packages (eg. shinyjs, timevis, colourpicker, shinyalert), Shiny tutorials, and a Shiny online video course on DataCamp.

Dean Attali

Founder & Lead R-Shiny Consultant at AttaliTech Ltd

Training: Programming with Data: Foundation of Python & Pandas

Whether in R, MATLAB, Stata, or Python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.

In this training, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

Instructor's Bio

Daniel Gerlanc has worked as a data scientist for more than decade and written software professionally for 15 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors. At Enplus, he works with clients on data science and custom software development with a particular focus on projects requiring an expertise in both areas. He teaches data science and software development at introductory through advanced levels. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups.

Daniel Gerlanc

President at Enplus Advisors Inc.

Training: Deploying Deep Learning models as Microservices

Powering your application with deep learning is no walk in the park, but is certainly attainable with some tricks and good practice. Serving a deep learning model on a production system demands the model to be stable, reproducible, capable of isolation and behave as a stand-alone package. One possible solution to this is a containerized microservice. 

Ideally, serving deep learning microservices should be quick and efficient, without having to dive deep into the underlying algorithms and their implementation. Too good to be true? Not anymore! Together, we will demystify the process of developing, training, and deploying deep learning models as a web microservice.

We will kick off with an overview of how deep learning models are best published as Docker images on DockerHub, and are best prepared for deployment in local or cloud environments using Kubernetes or Docker…more details

Instructor's Bio

Gabriela de Queiroz is a Sr. Engineering & Data Science Manager/Sr. Developer Advocate at IBM where she leads and manages a team of data scientists and software engineers to contribute to open source and artificial intelligence projects. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She is the founder of R-Ladies, a worldwide organization for promoting diversity in the R community with more than 175 chapters in 45+ countries. She is now working to make AI more diverse and inclusive in her new organization, AI Inclusive. She has worked in several startups where she built teams, developed statistical and machine learning models and employed a variety of techniques to derive insights and drive data-centric decisions.

Gabriela de Queiroz

Sr. Engineering and Data Science Manager, Founder at IBM, R-Ladies

Training: Deploying Deep Learning models as Microservices

Powering your application with deep learning is no walk in the park, but is certainly attainable with some tricks and good practice. Serving a deep learning model on a production system demands the model to be stable, reproducible, capable of isolation and behave as a stand-alone package. One possible solution to this is a containerized microservice. 

Ideally, serving deep learning microservices should be quick and efficient, without having to dive deep into the underlying algorithms and their implementation. Too good to be true? Not anymore! Together, we will demystify the process of developing, training, and deploying deep learning models as a web microservice.

We will kick off with an overview of how deep learning models are best published as Docker images on DockerHub, and are best prepared for deployment in local or cloud environments using Kubernetes or Docker…more details

Instructor's Bio

Simon is a Developer Advocate at the Center for Open-Source Data & AI Technologies. Previously, he worked as a machine learning consultant in Europe, and was with UC San Francisco before that. Simon holds a Master’s degree in Bioinformatics engineering, and a Bachelor’s degree in molecular biology.

Simon Plovyt

Developer Advocate at Center for Open-Source Data & AI Technologies

Training: Deploying Deep Learning models as Microservices

Powering your application with deep learning is no walk in the park, but is certainly attainable with some tricks and good practice. Serving a deep learning model on a production system demands the model to be stable, reproducible, capable of isolation and behave as a stand-alone package. One possible solution to this is a containerized microservice. 

Ideally, serving deep learning microservices should be quick and efficient, without having to dive deep into the underlying algorithms and their implementation. Too good to be true? Not anymore! Together, we will demystify the process of developing, training, and deploying deep learning models as a web microservice.

We will kick off with an overview of how deep learning models are best published as Docker images on DockerHub, and are best prepared for deployment in local or cloud environments using Kubernetes or Docker…more details

Instructor's Bio

Saishruthi Swaminathan is a developer advocate and data scientist in the IBM CODAIT team, whose main focus is to democratize data and AI through open source technologies. Her passion is to dive deep into the ocean of data, extract insights, and use AI for social good. Previously, she worked as a software developer. On a mission to spread the knowledge and experience, she acquired in her learning process. She also leads education for rural children initiative and organizing meetups focusing on women empowerment. She has a masters in electrical engineering, specializing in data science and a bachelor’s degree in electronics and instrumentation. She can be found on “LinkedIn”:https://www.linkedin.com/in/saishruthi-swaminathan/ and “Medium”:https://medium.com/@saishruthi.tn.

Saishruthi Swaminathan

Developer Advocate and Data Scientist at IBM CODAIT

Training: Reveal Predictive Patterns with Neo4j Graph Algorithms

Data scientists are using graphs to help improve predictions and enhance their machine learning models. Application developers are extending their use of Neo4j’s native graph platform to incorporate native graph algorithms to build more intelligent applications. Using the two together shows us how connected data can impact our understanding of data structure and improve our analysis and predictions. Graph data analytics provides capabilities and insight that other tools and stores cannot uncover.
In this session, you will learn how to combine Neo4j, the Cypher query language, and graph algorithms for data science and analytics uses. The presenters will discuss the Neo4j graph database through an analytics lens and explain how Neo4j’s graph algorithms library is architected. We’ll cover exploratory analysis as well as how to extract more predictive patterns and structures in your data using graph algorithms…more details

Instructor's Bio

Amy E. Hodler is a network science devotee and AI and graph analytics program manager at Neo4j. She promotes the use of graph analytics to reveal structures within real-world networks and predict dynamic behavior. Amy helps teams apply novel approaches to generate new opportunities at companies such as EDS, Microsoft, Hewlett-Packard (HP), Hitachi IoT, and Cray. Amy has a love for science and art with a fascination for complexity studies and graph theory. She tweets as @amyhodler.

Amy E. Hodler

Graph Analytics & AI Program Manager, Author at Neo4j

Training: Reveal Predictive Patterns with Neo4j Graph Algorithms

Data scientists are using graphs to help improve predictions and enhance their machine learning models. Application developers are extending their use of Neo4j’s native graph platform to incorporate native graph algorithms to build more intelligent applications. Using the two together shows us how connected data can impact our understanding of data structure and improve our analysis and predictions. Graph data analytics provides capabilities and insight that other tools and stores cannot uncover.
In this session, you will learn how to combine Neo4j, the Cypher query language, and graph algorithms for data science and analytics uses. The presenters will discuss the Neo4j graph database through an analytics lens and explain how Neo4j’s graph algorithms library is architected. We’ll cover exploratory analysis as well as how to extract more predictive patterns and structures in your data using graph algorithms…more details

Instructor's Bio

Coming Soon!

Jennifer Reif

Developer Relations Engineer at Neo4j

Training: Building Sentence Similarity Applications at Scale

Comparing the similarity of two sentences is an integral part of many Natural Language Processing scenarios. These scenarios range from search and retrieval, nearest-neighbor to kernel-based classification methods, recommendation, and ranking tasks. Building state of the art models at production level scale can be difficult when you’re on a small team and not both an NLP and DevOps expert. In this workshop, we will walk through the Natural Language Processing Best Practices Github Repo (https://github.com/microsoft/nlp ) provided by Microsoft on how to create baseline representation models for Sentence Similarity scenarios from popular open source technologies like gensim and scikit-learn. We will then use Microsoft’s Automated Machine Learning to create a competitive model with popular sentence encoders from Google and create reusable machine learning pipelines deployed at scale on Azure Kubernetes Services.

Instructor's Bio

Courtney Cochrane is a Data Scientist in the Microsoft Artificial Intelligence Development Acceleration Program (MAIDAP). In her current role, she is responsible for accelerating the integration and development of AI models across Microsoft’s different product teams. She has previously partnered with Office, and most recently has worked with Microsoft’s Azure Machine Learning team to create a repo that showcases best practices for NLP scenarios (https://github.com/microsoft/nlp). Prior to joining Microsoft, she earned a degree in Mathematics and Computer Science from Davidson College and a Master’s in Computational Science and Engineering from Harvard University.

Courtney Cochrane

Data Scientist at Microsoft

Training: Building Sentence Similarity Applications at Scale

Comparing the similarity of two sentences is an integral part of many Natural Language Processing scenarios. These scenarios range from search and retrieval, nearest-neighbor to kernel-based classification methods, recommendation, and ranking tasks. Building state of the art models at production level scale can be difficult when you’re on a small team and not both an NLP and DevOps expert. In this workshop, we will walk through the Natural Language Processing Best Practices Github Repo (https://github.com/microsoft/nlp ) provided by Microsoft on how to create baseline representation models for Sentence Similarity scenarios from popular open source technologies like gensim and scikit-learn. We will then use Microsoft’s Automated Machine Learning to create a competitive model with popular sentence encoders from Google and create reusable machine learning pipelines deployed at scale on Azure Kubernetes Services.

Instructor's Bio

Abhiram is a Software Engineer at Microsoft’s AI Acceleration and Development Program. His duties are similar to a machine learning engineer working at the intersection of systems and AI. Over the last year at Microsoft, he has built machine learning pipelines for applying ML in Systems and developing operationizable pipelines, applying systems to ML. Before his time at Microsoft, he was a Software Developer at InMobi before returning back to college to do masters at University of Massachusetts, Amherst.

Abhiram Eswaran

Software Engineer at Microsoft

Training: Building Sentence Similarity Applications at Scale

Comparing the similarity of two sentences is an integral part of many Natural Language Processing scenarios. These scenarios range from search and retrieval, nearest-neighbor to kernel-based classification methods, recommendation, and ranking tasks. Building state of the art models at production level scale can be difficult when you’re on a small team and not both an NLP and DevOps expert. In this workshop, we will walk through the Natural Language Processing Best Practices Github Repo (https://github.com/microsoft/nlp ) provided by Microsoft on how to create baseline representation models for Sentence Similarity scenarios from popular open source technologies like gensim and scikit-learn. We will then use Microsoft’s Automated Machine Learning to create a competitive model with popular sentence encoders from Google and create reusable machine learning pipelines deployed at scale on Azure Kubernetes Services.

Instructor's Bio

Janhavi started working for Microsoft within a few months post-graduation. She has a Masters in Computer Science from Northeastern University and an undergraduate degree from University of Mumbai. After undergraduate studies, she worked for 2 years at JP Morgan Chase and Co in India and then moved to Boston for graduate studies.

Janhavi Mahajan

Software Engineer at Microsoft

Training: Building Sentence Similarity Applications at Scale

Comparing the similarity of two sentences is an integral part of many Natural Language Processing scenarios. These scenarios range from search and retrieval, nearest-neighbor to kernel-based classification methods, recommendation, and ranking tasks. Building state of the art models at production level scale can be difficult when you’re on a small team and not both an NLP and DevOps expert. In this workshop, we will walk through the Natural Language Processing Best Practices Github Repo (https://github.com/microsoft/nlp ) provided by Microsoft on how to create baseline representation models for Sentence Similarity scenarios from popular open source technologies like gensim and scikit-learn. We will then use Microsoft’s Automated Machine Learning to create a competitive model with popular sentence encoders from Google and create reusable machine learning pipelines deployed at scale on Azure Kubernetes Services.

Instructor's Bio

Heather Spetalnick (formerly Shapiro) is a Program Manager from Boston on the Azure Machine Learning team. She works to ensure successful user experiences across the Azure Machine Learning platform for professional data scientists. She is also working on building a set of best across areas such as Recommendation systems, Computer Vision, and NLP, to bring fast-tracked education to enable users to learn and employ the scenario for the problem they are facing. Prior to becoming a Program Manager, Heather was a Technical Evangelist in New York, where she worked closely with partners, developer communities, and students to help them learn, adopt, and use preferred data science and machine learning practices and tools from Microsoft. Heather completed her undergraduate degree at Duke University and graduated in the Class of 2015. She received Bachelors of Science in Computer Science and Statistical Science, and completed an Honors Thesis about employing Bayesian Approaches to Understanding Music Popularity. Heather tweets athttp://twitter.com/hspetalnick.

Heather Spetalnick

Program Manager at Microsoft

Training: Managing the Complete Machine Learning Lifecycle with MLflow

ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.

Jules Damji walks you through MLflow, an open source project that simplifies the entire ML lifecycle, to solve this problem. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size…more details

Instructor's Bio

Jules S. Damji is an Apache Spark community and developer advocate at Databricks. He’s a hands-on developer with over 20 years of experience. Previously, he worked at leading companies such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, Verisign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a BSc and MSc in computer science and MA in political advocacy and communication from Oregon State University, the California State University, and Johns Hopkins University, respectively.

Jules Damji

Apache Spark Community & Developer Advocate at Databricks

Training: Building a Natural Language Question and Answer Search Engine

Using articles published on the NASDAQ website as a data source, a search engine that gives direct answers to naturally phrased questions will be built by attendees. An example query might be, ‘What market does FitBit compete in?’, and the expected output for this query would be something akin to: ‘wearables’.

To accomplish this task, Python and the open-source libraries Whoosh and DeepPavlov will be used as the main tools to build the search engine. Whoosh will be used to build a more traditional text-based search engine, and a pre-trained Deep Learning model from DeepPavlov will be used to perform natural language Q&A with the documents returned by Whoosh.

While covering Whoosh there will be a general discussion of information retrieval and deeper dives into the NLP tasks being performed by Whoosh…more details

Instructor's Bio

Adam Spannbauer is a machine learning engineer at Eastman Chemical Company in East Tennessee. His work history has a focus on NLP projects using open source tools such as Python, R, and Shiny. Adam has instructor experience through his DataCamp course: “Software Engineering for Data Scientists in Python.” Outside of work, Adam stays active in the open source community on GitHub, mostly working on side-projects involving computer vision. He holds degrees from Maryville College and the University of Tennessee.

Adam Spannbauer

Machine Learning Engineer at Eastman Chemical Company

Training: Deep Learning for Healthcare

In this tutorial, we present deep learning methods and their applications in computational healthcare, specifically focusing on clinical representation learning, predictive modeling, clinical trial modeling and drug development.
We will also introduce different types of data in healthcare including structured electronic health records, unstructured clinical notes, medical images, clinical trial description, chemical compounds and medical knowledge base.
This tutorial is intended for data scientists, engineers and researchers who are interested in applying deep learning methods to healthcare, and prerequisite knowledge include basic machine knowledge. The first half will be spent on introducing the nature of health data, basic deep learning methods and their application in healthcare…more details

Instructor's Bio

Cao (Danica) Xiao is the Director of Machine Learning at Analytics Center of Excellence of IQVIA. She is leading IQVIA’s North America machine learning team to drive next generation healthcare AI. Her team works on various projects on disease prediction, clinical trial enrollment modeling, and in silico drug modeling (e.g., adverse drug reaction detection, drug repositioning and de novo design). Her research focuses on using machine learning and data mining approaches to solve diverse real world healthcare challenges. Particularly, she is interested in phenotyping on electronic health records, data mining for in-silico drug modeling, biomarker discovery and patient segmentation for neuro-degenerative diseases. Her research has been published in leading AI conferences including KDD, NIPS, ICLR, AAAI, IJCAI, SDM, ICDM, WWW and top health informatics journals such as Nature Scientific Reports and JAMIA. Prior to IQVIA, she was a research staff member in the AI for Healthcare team at IBM Research from 2017 to 2019 and served as member of the IBM Global Technology Outlook Committee from 2018 to 2019. She acquired her Ph.D. degree from University of Washington, Seattle in 2016.

Cao (Danica) Xiao, PhD

Director of Machine Learning
at IQVIA

Training: Deep Learning for Healthcare

In this tutorial, we present deep learning methods and their applications in computational healthcare, specifically focusing on clinical representation learning, predictive modeling, clinical trial modeling and drug development.
We will also introduce different types of data in healthcare including structured electronic health records, unstructured clinical notes, medical images, clinical trial description, chemical compounds and medical knowledge base.
This tutorial is intended for data scientists, engineers and researchers who are interested in applying deep learning methods to healthcare, and prerequisite knowledge include basic machine knowledge. The first half will be spent on introducing the nature of health data, basic deep learning methods and their application in healthcare…more details

Instructor's Bio

Jimeng Sun is an Associate Professor of College of Computing at Georgia Tech. Prior to Georgia Tech, he was a researcher at IBM TJ Watson Research Center. His research focuses on health analytics and data mining, especially in designing tensor factorizations, deep learning methods, and large-scale predictive modeling systems. Dr. Sun has been collaborating with many healthcare organizations.
He published over 120 papers and filed over 20 patents (5 granted). He has received SDM/IBM early career research award 2017, ICDM best research paper award in 2008, SDM best research paper award in 2007, and KDD Dissertation runner-up award in 2008. Dr. Sun received B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, PhD in Computer Science from Carnegie Mellon University in 2007 advised by Christos Faloutsos.

Jimeng Sun, PhD

Associate Professor at Georgia Tech

Training: Interpretable Knowledge Discovery Reinforced by Visual Methods

This tutorial will cover the state-of-the-art research, development, and applications in the KDD area of interpretable knowledge discovery reinforced by visual methods to stimulate and facilitate future work. It will serve the KDD mission of gaining insight from data. The topic is interdisciplinary bridging scientific research and applied communities in KDD, Visual Analytics, Information Visualization, and HCI. This is a novel and fast growing area with significant applications, and potential.

First, in KDD, these studies have grown under the name of visual data mining. The recent growth under the names of deep visualization, and visual knowledge discovery, is motivated considerably by deep learning success in accuracy of prediction and its failure in explanation of produced models without special interpretation efforts…more details

Instructor's Bio

Dr. Boris Kovalerchuk is a professor of Computer Science at Central Washington University, USA. His publications include three books “Data Mining in Finance ” (Springer, 2000), “Visual and Spatial Analysis” (Springer, 2005), and “Visual Knowledge Discovery and Machine Learning” (Springer, 2018), a chapter in the Data Mining Handbook and over 170 other publications. His research interests are in data mining, machine learning, visual analytics, uncertainty modeling, data fusion, relationships between probability theory and fuzzy logic, image and signal processing. Dr. Kovalerchuk has been a principal investigator of research projects in these areas supported by the US Government agencies. He served as a senior visiting scientist at the US Air Force Research Laboratory and as a member of expert panels at the international conferences and panels organized by the US Government bodies.

Boris Kovalerchuk, PhD 

Professor at Central Washington University

Training: Five World-Class Visualizations and What We Can Learn From Them

What separates the world’s best data visualizations from the rest? Join Metis Senior Data Scientist Jonathan Balaban as he highlights amazing visuals and dissects best practices in visualization and storytelling. Then, he will guide you through open-source platforms and code that enable similar, compelling visuals we can use to share our work.
Outline:
1. Introduce five impactful and dynamic visualizations
Discuss use, design choices, and pros/cons of our visualizations
Review which domains or data are best suited for our visualizations
Familiarization with dataset and quick visualization as baseline
2. Introduce packages for build, and walk through each visualization’s codeset
Review default arguments and method choices
Consider alternatives and when they would be appropriate…more details

Instructor's Bio

Jonathan is an instructor at Metis’s San Francisco bootcamp, but he has crossed the country with Metis as a previous instructor at the Seattle and Chicago campuses as well. He enjoys teaching the art of impact-focused, practical data science and helping students find amazing careers with top-tier companies like Apple, Tesla, and Amazon. As a data scientist, he has worked at McKinsey and Booz Allen Hamilton and consulted for numerous companies. He has led teams to design bespoke data science solutions that have driven revolutionary changes in client operations. Jonathan – sometimes successfully – leverages data science solutions in his personal life: on friends, racing, and training.

Jonathan Balaban

Senior Data Scientist & Instructor at Metis

Training: Practical Data Ethics with Deon

Our goal is for every data scientist to practice data ethics. In this workshop, you will learn how to make data ethics actionable as a data scientist.

For this workshop, we’ll use the resources collected in deon (http://deon.drivendata.org), an open source command line tool that integrates an ethics checklist into your existing data science workflow. The goal of deon is to enable teams to flexibly carry out the ethical discussions most relevant to them, and to preemptively address issues they may overlook. Instead of relying on an “Ultimately True” philosophy or oath, deon encourages an upfront and ongoing dialogue about the different ethical aspects of your project. This dialogue will span a broad set of ethical issues that frequently arise in machine learning and data science contexts. Specific solutions often vary with the task at hand, but deon helps cultivate ethical intentionality for the first line of defense––the engineers that have an influence over how data science actually gets done…more details

Instructor's Bio

Casey Fitzpatrick is a Machine Learning Engineer at DrivenData, where he consults mission-driven organizations across a variety of domains concerning machine learning, deep learning, and data science education. Previously, Casey obtained his PhD in Electrical Engineering from Boston University, where his research and publications focused on applying high-dimensional quantum optics to problems in quantum computing.

Casey Fitzpatrick, PhD

Machine Learning Engineer at DrivenData

Training: Practical Data Ethics with Deon

Our goal is for every data scientist to practice data ethics. In this workshop, you will learn how to make data ethics actionable as a data scientist.

For this workshop, we’ll use the resources collected in deon (http://deon.drivendata.org), an open source command line tool that integrates an ethics checklist into your existing data science workflow. The goal of deon is to enable teams to flexibly carry out the ethical discussions most relevant to them, and to preemptively address issues they may overlook. Instead of relying on an “Ultimately True” philosophy or oath, deon encourages an upfront and ongoing dialogue about the different ethical aspects of your project. This dialogue will span a broad set of ethical issues that frequently arise in machine learning and data science contexts. Specific solutions often vary with the task at hand, but deon helps cultivate ethical intentionality for the first line of defense––the engineers that have an influence over how data science actually gets done…more details

Instructor's Bio

Jay Qi is a Senior Data Scientist at DrivenData, where he uses data science for social good and helps mission-driven organizations leverage data to maximize their impact. Previously, Jay was a Lead Data Scientist at Uptake, where he used machine learning with streaming sensor data to predict failures on industrial machines like locomotives and heavy equipment.

Jay Qi

Senior Data Scientist at DrivenData

Training: Good, Fast, Cheap: How to Do Data Science with Missing Data

If you’ve never heard of the “good, fast, cheap” dilemma, it goes something like this: You can have something good and fast, but it won’t be cheap. You can have something good and cheap, but it won’t be fast. You can have something fast and cheap, but it won’t be good. In short, you can pick two of the three but you can’t have all three.

If you’ve done a data science problem before, I can all but guarantee that you’ve run into missing data. How do we handle it? Well, we can avoid, ignore, or try to account for missing data. The problem is, none of these strategies are good, fast, *and* cheap…more details

Instructor's Bio

Coming Soon!

Joseph Nelson

Data Scientist & Developer Advocate at General Assembly

Training: Data Visualization: From Square One to Interactivity

As data scientists, we are expected to be experts in machine learning, programming, and statistics. However, our audiences might not be! Whether we’re working with peers in the office, trying to convince our bosses to take some sort of action, or communicating results to clients, there’s nothing more clear or compelling than an effective visual to make our point. Let’s leverage the Python libraries Matplotlib and Bokeh along with visual design principles to make our point as clearly and as compellingly as possible! This talk is designed for a wide audience. If you haven’t worked with Matplotlib or Bokeh before or if you (like me!) don’t have a natural eye for visual design, that’s OK! This will be a hands-on training designed to make visualizations that best communicate what you want to communicate…more details

Instructor's Bio

Coming Soon!

Joseph Nelson

Data Scientist & Developer Advocate at General Assembly

Training: Introduction to Flink via Flink SQL

As data processing becomes more real time, stream processing is becoming more important. Apache Flink makes it easier to build and manage stream processing applications. Flink’s new SQL interface is a great way to get started with Flink—and to build and maintain production applications.

Seth Wiesman offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink’s various modes of use. Then you’ll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API…more details

Instructor's Bio

Coming Soon!

Seth Wiesman

Solutions Architect at Ververica

Training: Applied Natural Language Processing in EdTech

Understanding the questions posed by instructors and students alike plays an important role in the development of educational technology applications. In this intermediate level workshop, you will learn to apply NLP to one piece of this real-world problem by building a model to predict the type of answer (e.g. entity, description, number, etc.) a question elicits. Specifically, you will learn to:
1. Perform preprocessing, normalization, and exploratory analysis on a question dataset,
2. Identify salient linguistic features of natural language questions, and
3. Experiment with different feature sets and models to predict the answer type.

The concepts will be taught using popular NLP and ML packages like SpaCy, Scikit Learn, and Tensorflow…more details

Instructor's Bio

Coming Soon!

Vaidy Venkitasubramanian

Director of Applied Machine Learning
at Course Hero

Training: Applied Natural Language Processing in EdTech

Understanding the questions posed by instructors and students alike plays an important role in the development of educational technology applications. In this intermediate level workshop, you will learn to apply NLP to one piece of this real-world problem by building a model to predict the type of answer (e.g. entity, description, number, etc.) a question elicits. Specifically, you will learn to:
1. Perform preprocessing, normalization, and exploratory analysis on a question dataset,
2. Identify salient linguistic features of natural language questions, and
3. Experiment with different feature sets and models to predict the answer type.

The concepts will be taught using popular NLP and ML packages like SpaCy, Scikit Learn, and Tensorflow…more details

Instructor's Bio

Coming Soon!

John D’Souza

Senior Machine Learning Engineer at Course Hero

Training: Applied Natural Language Processing in EdTech

Understanding the questions posed by instructors and students alike plays an important role in the development of educational technology applications. In this intermediate level workshop, you will learn to apply NLP to one piece of this real-world problem by building a model to predict the type of answer (e.g. entity, description, number, etc.) a question elicits. Specifically, you will learn to:
1. Perform preprocessing, normalization, and exploratory analysis on a question dataset,
2. Identify salient linguistic features of natural language questions, and
3. Experiment with different feature sets and models to predict the answer type.

The concepts will be taught using popular NLP and ML packages like SpaCy, Scikit Learn, and Tensorflow…more details

Instructor's Bio

Coming Soon!

John D’Souza

Senior Data Scientist at Course Hero

Training: Applied Natural Language Processing in EdTech

Understanding the questions posed by instructors and students alike plays an important role in the development of educational technology applications. In this intermediate level workshop, you will learn to apply NLP to one piece of this real-world problem by building a model to predict the type of answer (e.g. entity, description, number, etc.) a question elicits. Specifically, you will learn to:
1. Perform preprocessing, normalization, and exploratory analysis on a question dataset,
2. Identify salient linguistic features of natural language questions, and
3. Experiment with different feature sets and models to predict the answer type.

The concepts will be taught using popular NLP and ML packages like SpaCy, Scikit Learn, and Tensorflow…more details

Instructor's Bio

Coming Soon!

Emmanuel Matthews

Technical Program Manager Data Products at Course Hero

Workshop Sessions

More sessions added weekly

Workshop: Tutorial on Deep Reinforcement Learning

Reinforcement learning considers the problem of learning to act and is poised to power next generation AI systems, which will need to go beyond input-output pattern recognition (even if such simpler AI has sufficed for speech, vision, machine translation) and will have to generate intelligent behavior. Example application domains include robotics, marketing, dialogue, HVAC, optimizing healthcare and supply chains.

In this tutorial we will cover the foundations of Deep RL (including, but not limited to: CEM, DQN, TRPO, PPO, SAC) as well as dive into the specifics of some of the main success stories and provide perspective on where the field is headed.

To get the most out of this tutorial, the audience is assumed to have basic familiarity with neural networks, optimization, probability…more details

Instructor's Bio

Professor Pieter Abbeel is Director of the Berkeley Robot Learning Lab and Co-Director of the Berkeley Artificial Intelligence (BAIR) Lab. Abbeel’s research strives to build ever more intelligent systems, which has his lab push the frontiers of deep reinforcement learning, deep imitation learning, deep unsupervised learning, transfer learning, meta-learning, and learning to learn, as well as study the influence of AI on society. His lab also investigates how AI could advance other science and engineering disciplines. Abbeel’s Intro to AI class has been taken by over 100K students through edX, and his Deep RL and Deep Unsupervised Learning materials are standard references for AI researchers. Abbeel has founded three companies: Gradescope (AI to help teachers with grading homework and exams), Covariant (AI for robotic automation of warehouses and factories), and Berkeley Open Arms (low-cost, highly capable 7-dof robot arms), advises many AI and robotics start-ups, and is a frequently sought after speaker worldwide for C-suite sessions on AI future and strategy. Abbeel has received many awards and honors, including the PECASE, NSF-CAREER, ONR-YIP, Darpa-YFA, TR35. His work is frequently featured in the press, including the New York Times, Wall Street Journal, BBC, Rolling Stone, Wired, and Tech Review.

Pieter Abbeel, PhD

Professor & Director of the Robot Learning Lab, Co-Founder, Advisor | UC Berkeley, BAIR, covariant.ai, Gradescope, OpenAI

workshop: Training Gradient Boosting Models on Large Datasets with CatBoost

Gradient boosting is a machine-learning technique that achieves state-of-the-art results in a variety of practical tasks. For a number of years, it has remained the primary method for learning problems with heterogeneous features, noisy data, and complex dependencies: web search, recommendation systems, weather forecasting, and many others.

CatBoost (http://catboost.ai) is one of the three most popular gradient boosting libraries. It has a set of advantages that differentiate it from other libs…more details

Instructor's Bio

Anna Veronika Dorogush graduated from the Faculty of Computational Mathematics and Cybernetics of Lomonosov Moscow State University and from Yandex School of Data Analysis. She used to work at ABBYY, Microsoft, Bing and Google, and has been working at Yandex since 2015, where she currently holds the position of the head of Machine Learning Systems group and is leading the efforts in development of the CatBoost library.

Anna Veronika Dorogush

CatBoost Team Lead at Yandex

workshop: Integrating Elasticsearch with Analytics Workflows

As larger quantities of data are being stored and managed by enterprises of all kinds, NoSQL storage solutions are becoming more popular. Elasticsearch is a popular, high-performance NoSQL data storage option, but it is often unfamiliar to end users and difficult to navigate for day to day analytic tasks.

This presentation will briefly discuss the structure and benefits/drawbacks of Elasticsearch data storage, and describe in detail, with examples, how your end users can get data out of Elasticsearch data storage powerfully and with confidence. Attendees will be introduced to three packages designed for this work, elastic (R), elasticsearch-py (Python), and uptasticsearch (R and Python), and will see hands-on examples of how to use them.

Instructor's Bio

Stephanie Kirmer is a Senior Data Scientist at Journera, an early stage startup that helps companies in the travel industry use data efficiently and securely to create better travel experiences for customers.
Previously she worked as a Senior Data Scientist at Uptake, where she developed tools for analyzing diesel engine fuel efficiency, and constructed predictive models for diagnosing and preventing mechanical failure. Before joining Uptake, she worked on data science for social policy research at the University of Chicago and taught sociology and health policy at DePaul University.

Stephanie Kirmer

Senior Data Scientist at Journera

workshop: Scalable Machine Learning with Kubernetes and Kubeflow

Kubeflow seeks to simplify the process of building, deploying, and scaling machine learning workflows on Kubernetes. Originally an internal project at Google for streamlining TensorFlow jobs on Kubernetes but has since been open sourced and now supports a variety of frameworks and workflows. Kubeflow takes care of many common pain points, allowing data science teams to focus on building and deploying models instead of managing a brittle network of systems held together with glue code. This workshop will provide an overview of the benefits of using containers, Kubernetes, and Kubeflow to build portable and scalable machine learning pipelines, as well as hands on exercises building, deploying, and consuming Kubeflow based models on local and public cloud-based infrastructure…more details

Instructor's Bio

As a Senior Data Scientist and Instructor at Metis, John Tate leads immersive data science programs training students in areas including machine learning, math & statistics, and the python data science toolkit. Previously, John worked as a consultant developing end to end data science solutions for clients in industries such as Healthcare, Pharma, Finance, Automotive, and Marketing Technology.

John Tate

Sr. Data Science Instructor at Metis

workshop: Practical Deep Learning for Images, Sensor and Text

Deep Learning became known for beating human performance on image classification, and most applications and published examples for deep learning still focus on image processing. However, deep learning nowadays can also be applied successfully to time-series (or sensor) data and text. In this hands-on MATLAB workshop, you will see how easy it is to get started applying deep learning to various data types, including images, text, and time-series data. You’ll use an online MATLAB instance to perform the following tasks:
1. Train deep neural networks on GPUs in the cloud
2. Access and explore pretrained models
3. Build a CNN to solve an image classification problem
4. Use LSTM networks to solve a time-series and text analytics problem.

Instructor's Bio

Renee is an Application Engineer supporting the Medical Devices Industry in Data Analytics and Technical Computing applications.  She works closely with engineers and researchers in the biomedical community to understand and address the unique challenges and needs in this industry.  Renee graduated Northwestern University with an M.S. in Biomedical Engineering.  Her research was in medical imaging focusing on quantitative cerebrovascular perfusion MRI of the brain for stroke prevention.  She joined the MathWorks in 2012 helping customers with MATLAB, analysis, and graphics challenges, and later transferred to Application Engineering where she specialized in Test and Measurement applications before transitioning to her current role.

Renee Qian

Application Engineer at MathWorks

workshop: Spark NLP: State of the Art Natural Language Processing at Scale

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, and entity extraction. This talk introduces the open-source Spark NLP library, which within two years have become the most widely used NLP library in the enterprise – by implementing state-of-the-art deep learning NLP research as a production-grade, fast and scalable library for Python, Java and Scala…more details

Instructor's Bio

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a Ph.D. in computer science and master’s degrees in both computer science and business administration.

David Talby, PhD

CTO at Pacific AI

Workshop: Declarative Data Visualization with Vega-Lite & Altair

In this workshop, we will introduce the concepts of declarative data visualization, which are widely used by the Jupyter and Observable data science communities, and companies such as AirBnB, Apple, Elastic, Google, Microsoft, Netflix, Twitter, and Uber. You will learn the basic vocabulary for grammar of data visualizations and learn how to use this vocabulary to author interactive plots via declarative visualization libraries including Vega-Lite (in JavaScript) and Altair (in Python). With these libraries, users can rapidly and concisely create rich interactive visualizations. For example, brushing & linking among scatterplots and interactive cross filtering require only a few lines of code in Vega-Lite, versus hundreds in D3…more details

Instructor's Bio

Kanit is a researcher at Apple, working on data visualization and interactive systems for artificial intelligence. Prior to Apple, he was a PhD student in Computer Science and a Human-Computer Interaction researcher at the University Washington (UW), working with Prof. Jeffrey Heer and the UW Interactive Data Lab.
Systems he has developed have won awards at premier academic venues, and are used by the Jupyter/Python data science communities and leading tech companies including Apple, Google, Microsoft, Netflix, Twitter, and Uber.

Kanit Wongsuphasawat, PhD

Research Scientist at Apple

Workshop: Declarative Data Visualization with Vega-Lite & Altair

In this workshop, we will introduce the concepts of declarative data visualization, which are widely used by the Jupyter and Observable data science communities, and companies such as AirBnB, Apple, Elastic, Google, Microsoft, Netflix, Twitter, and Uber. You will learn the basic vocabulary for grammar of data visualizations and learn how to use this vocabulary to author interactive plots via declarative visualization libraries including Vega-Lite (in JavaScript) and Altair (in Python). With these libraries, users can rapidly and concisely create rich interactive visualizations. For example, brushing & linking among scatterplots and interactive cross filtering require only a few lines of code in Vega-Lite, versus hundreds in D3…more details

Instructor's Bio

Dominik is a researcher at Apple working on data visualization and interactive systems for artificial intelligence. In 2020, he will start as an assistant professor at Carnegie Mellon University.
Dominik received his PhD from the Paul G. Allen School at the University of Washington where he worked with Jeffrey Heer and Bill Howe. His thesis work was on scalable interactive systems for visualization and analysis.
Dominik is a co-author of various libraries and tools in the Vega stack, including Vega-Lite, Voyager, and Polestar. His systems have won awards at premier academic venues and are used by the Python and JavaScript data science communities.
When he is not working on research or coding, Dominik likes to travel, sail, hike in the mountains around Seattle, or bake bread.

Dominik Moritz

ML Researcher, Research Assistant at Apple, University of Washington

Workshop: Missing Data in Supervised Machine Learning

Most implementations of supervised machine learning algorithms are designed to work with complete datasets, but datasets are rarely complete. This dichotomy is usually addressed by either deleting points with missing elements and losing potentially valuable information or imputing (trying to guess the values of the missing elements), which can lead to increased bias and false conclusions. I will quickly review the three types of missing data (missing completely at random, missing at random, missing not at random) and a couple of simple but often misleading ways to impute…more details

Instructor's Bio

Andras Zsom is a Lead Data Scientist at the Center for Computation and Visualization at Brown University. He is managing a small but dedicated team of data scientists with the mission to help high level university administrators to make better data-driven decisions with data analysis and predictive modeling, they collaborate with faculty members on various data-intensive academic projects, and they also train data science interns.
Andras is passionate about using machine learning and predictive modeling for good. He is an astrophysicist by training and he has been fascinated with all fields of the natural and life sciences since childhood. He was a postdoctoral researcher at MIT for 3.5 years before coming to Brown. He obtained his PhD from the Max Planck Institute of Astronomy at Heidelberg, Germany; and he was born and raised in Hungary.

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science at Center for Computation and Visualization, Brown University

Workshop: Deciphering the Black Box: Latest Tools and Techniques for Interpretability

This workshop shows how interpretability tools can give you not only more confidence in a model, but also help to improve model performance. Through this interactive workshop, you will learn how to better understand the models you build, along with the latest techniques and many tricks of the trade around interpretability. The workshop will largely focus on interpretability techniques, such as feature importance, partial dependence, and explanation approaches, such as LIME and Shap.
The workshop will demonstrate interpretability techniques with notebooks, some in R and some in Python. Along the way, workshop will consider issues like spurious correlation, random effects, multicollinearity, reproducibility, and other issues that may affect model interpretation and performance. To illustrate the points, the workshop will use easy to understand examples and references to open source tools to illustrate the techniques.

Instructor's Bio

Rajiv Shah is a data scientist at DataRobot, where his primary focus is helping customers improve their ability to make and implement predictions. Previously, Rajiv has been part of data science teams at Caterpillar and State Farm. He has worked on a variety of projects from a wide ranging set of areas including supply chain, sensor data, acturial ratings, and security projects. He has a PhD from the University of Illinois at Urbana-Champaign.

Rajiv Shah, PhD

Data Scientist at DataRobot

Workshop: ML Engineering Best Practices from the Trenches

DevOps tools for getting code reliably to production have proven to be effective in the software engineering world. Today, ML Engineers are working at the intersection of data science and software engineering, and can leverage DevOps best practices to streamline their workflow and delivery process. This is what MLOps is all about.

At Manifold, we’ve developed processes to help ML engineers work as an an integrated part of your development and production teams, helping you to be deliberate, disciplined, and coordinated in your deployment process. In this workshop, Sourav and Alex will walk you through the some key learnings from using this Lean AI process…more details

Instructor's Bio

As CTO for Manifold, Sourav is responsible for the overall delivery of data science and data product services to make clients successful. Before Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google / Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He earned his PhD, MS, and BS degrees from MIT in Electrical Engineering and Computer Science.

Sourav Dey, PhD

CTO at Manifold

MLOps: ML Engineering Best Practices from the Trenches

DevOps tools for getting code reliably to production have proven to be effective in the software engineering world. Today, ML Engineers are working at the intersection of data science and software engineering, and can leverage DevOps best practices to streamline their workflow and delivery process. This is what MLOps is all about.

At Manifold, we’ve developed processes to help ML engineers work as an an integrated part of your development and production teams, helping you to be deliberate, disciplined, and coordinated in your deployment process. In this workshop, Sourav and Alex will walk you through the some key learnings from using this Lean AI process…more details

Instructor's Bio

Alexander Ng is a Director, Infrastructure & DevOps at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Alex served as both a Sales Engineering Tech Lead and a DevOps Tech Lead for Kyruus, a startup that built SaaS products for enterprise healthcare organizations. Alex got his start as a Software Systems Engineer at the MITRE Corporation and the Naval Undersea Warfare Center in Newport, RI. His recent projects at the intersection of systems and machine learning continue to combine a deep understanding of the entire development lifecycle with cutting-edge tools and techniques. Alex earned his Bachelor of Science degree in Electrical Engineering from Boston University, and is an AWS Certified Solutions Architect.

Alex NG

Director, Infrastructure & DevOps at Manifold

Workshop: Machine Learning Interpretability Toolkit

With the recent popularity of machine learning algorithms such as neural networks and ensemble methods, etc., machine learning models become more like a ‘black box’, harder to understand and interpret. To gain the end user’s trust, there is a strong need to develop tools and methodologies to help the user to understand and explain how predictions are made. Data scientists also need to have the necessary insights to learn how the model can be improved. Much research has gone into model interpretability and recently several open sources tools, including LIME, SHAP, and GAMs, etc., have been published on GitHub. In this talk, we present Microsoft’s brand new Machine Learning Interpretability toolkit which incorporates the cutting-edge technologies developed by Microsoft and leverages proven third-party libraries. It creates a common API and data structure across the integrated libraries and integrates Azure Machine Learning services. Using this toolkit, data scientists can explain machine learning models using state-of-art technologies in an easy-to-use and scalable fashion at training and inferencing time.

Instructor's Bio

Mehrnoosh Sameki is a technical program manager at Microsoft responsible for leading the product efforts on machine learning transparency within the Azure Machine Learning platform. Prior to Microsoft, she was a data scientist in an eCommerce company, Rue Gilt Groupe, incorporating data science and machine learning in retail space to drive revenue and enhance personalized shopping experiences of customers and prior to that, she completed a PhD degree in computer science at Boston University. In her spare time, she enjoys trying new food recipes, watching classic movies and documentaries, and reading about interior design and house decoration.

Mehrnoosh Sameki, PhD

Technical Program Manager at Microsoft

Workshop: Pomegranate: Fast and Flexible Probabilistic Modeling in Python

Pomegranate is a Python package for probabilistic modeling that emphasizes both ease of use and speed. In keeping with the first emphasis, pomegranate has a simple sklearn-like API for training models and performing inference, and a convenient “lego API” that allows complex models to be specified out of simple components. In keeping with the second emphasis , the computationally intensive parts of pomegranate are written in efficient Cython code, all models support both multithreaded parallelism, out-of-core computations, and some models support GPU calculations. In this talk I will give an overview of the features in pomegranate, such as missing value support, demonstrate how the flexible provided by pomegranate can yield more accurate models, and draw examples from “popular culture” to inadvertantly prove how out of touch I am with today’s youth. I will also demonstrate how one can use the recently added custom distribution support to make neural probabilistic models, such as neural HMMs, using whatever your favorite neural network package is.

Instructor's Bio

Jacob Schreiber is a fifth year Ph.D. student and NSF IGERT big data fellow in the Computer Science and Engineering department at the University of Washington. His primary research focus is on the application of machine larning methods, primarily deep learning ones, to the massive amount of data being generated in the field of genome science. His research projects have involved using convolutional neural networks to predict the three dimensional structure of the genome and using deep tensor factorization to learn a latent representation of the human epigenome. He routinely contributes to the Python open source community, currently as the core developer of the pomegranate package for flexible probabilistic modeling, and in the past as a developer for the scikit-learn project. Future projects include graduating.

Jacob Schreiber

PhD Candidate at University of Washington

Workshop: Causal Inference for Data Science

I will present an overview of causal inference techniques that are a good addition to the toolbox of any data scientist, especially in certain circumstances where experimentation is limited. Use of these techniques can provide additional value from historical data as well to understand drivers of key metrics and other valuable insights. The session will be practical focused with both theory and how to perform techniques in R. The end of the session will close with recent advances from combining machine learning with causal inference techniques to do things such as speed up AB testing.

Instructor's Bio

Vinod Bakthavachalam is a Data Scientist working with the Content Strategy and Enterprise teams, focusing on using Coursera’s data to understand what are the most valuable skills across roles, industries, and geographies. Prior to Coursera, he worked in quantitative finance and studied Economics, Statistics, and Molecular & Cellular Biology at UC Berkeley.

Vinod Bakthavachalam

Data Scientist at Coursera

Workshop: Advanced Methods for Explaining XGBoost Models

Gradient Boosted Trees have become a widely used method for prediction using structured data. They generally provide the best predictive power, but are sometimes criticized for being “difficult to interpret”. However, to some degree, this criticism is misdirected — rather than being uninterpretable, they simply have more complicated interpretations, reflecting a more sophisticated understanding of the underlying dynamics of the variables.

In this workshop, we will work hands-on using XGBoost with real-world data sets to demonstrate how to approach data sets with the twin goals of prediction and understanding in a manner such that improvements in one area yield improvements in the other. Using modern tooling such as Individual Conditional Expectation (ICE) plots and SHAP, as well as a sense of curiosity, we will extract powerful insights that could not be gained from simpler methods. In particular, attention will be placed on how to approach a data set with the goal of understanding as well as prediction.

Instructor's Bio

Brian Lucena is Principal at Lucena Consulting and a consulting Data Scientist at Agentero. An applied mathematician in every sense, he is passionate about applying modern machine learning techniques to understand the world and act upon it. In previous roles he has served as SVP of Analytics at PCCI, Principal Data Scientist at Clover Health, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.

Brian Lucena, PhD

Consulting Data Scientist at Agentero

Workshop: Healthcare NLP with a Doctor's Bag of Notes

Nausea, vomiting, and diarrhea are words you would not frequently find in a natural language processing (NLP) project for tweets or product reviews. However, these words are common in healthcare. In fact, many clinical signs and patient symptoms (e.g. shortness of breath, fever, or chest pain) are only present in free-text notes and are not captured with structured numerical data. As a result, it is important for healthcare data scientists to be able to extract insight from unstructured clinical notes in electronic medical records.

In this hands-on workshop, the audience will have the opportunity to complete a Python NLP project with doctors’ discharge summaries to predict unplanned hospital readmission…more details

Instructor's Bio

Andrew Long is a Senior Data Scientist at Fresenius Medical Care North America (FMCNA). Andrew holds a PhD in biomedical engineering from Johns Hopkins University and a Master’s degree in mechanical engineering from Northwestern University. Andrew joined FMCNA in 2017 after participating in the Insight Health Data Fellows Program. At FMCNA, he is responsible for building, piloting, and deploying predictive models using machine learning to improve the quality of life of every patient who receives dialysis from FMCNA. He currently has multiple models in production to predict which patients are at the highest risk of negative outcomes.

Andrew Long, PhD

Data Scientist at Fresenius Medical Care

Workshop: Visualizing Complexity: Dimensionality Reduction and Network Science

Working with mathematicians, data scientists, and domain experts at the University of Vermont Complex Systems Center, data visualization artist Jane Adams has developed strategies for prototyping exploratory graphs of high-dimensional data. In this 90-minute workshop, Adams shares some of these methods for data discovery and interaction, navigating a creative workflow from paper prototypes of visual hypotheses through web-based interactive slices, offering critical insight for clustering, interpolation, and feature engineering.

Instructor's Bio

Jane Adams is the resident Data Visualization Artist at the University of Vermont Complex Systems Center in Burlington, VT, in partnership with the Data Science team at MassMutual Life Insurance. Adams collaborates with fellow researchers to make complex, temporally dynamic networks comprehensible through engaging, interactive visualizations. In her personal time, she builds interactive aquaponic ecosystems, generates digital data paintings of musical scores, and illustrates cartoon graphs inspired by the world around her. She is a community organizer with Vermont Women in Machine Learning & Data Science (VT WiMLDS) and an advocate for extradisciplinary inquiry. Stay in touch on Twitter @artistjaneadams

Jane Adams

Data Visualization Artist at University of Vermont Complex Systems Center

Workshop: Opening the Pod Bay Doors: Building Intelligent Agents That Can Interpret, Generate and Learn from Natural Language

Thanks to advances in imitation and reinforcement learning techniques, we can now train intelligent agents to accomplish a diverse range of goals. But if we want to create household robots or personal assistants that can take advantage of this diversity, we need to give users some way to tell them what to do! This tutorial will focus on humans’ favorite tool for communicating goals and plans: natural language. We’ll assume basic familiarity with supervised learning and RL, and begin with a review of core machine learning techniques useful for natural language instruction following problems. The body of the talk will focus on modeling techniques for instruction following problems in different kinds of environments and data conditions. We’ll conclude with a survey of other applications for the tools we’ve built, including instruction generation, interpretability, and machine teaching.

Jacob Andreas, PhD

Assistant Professor at MIT CSAIL

Workshop: Real-ish Time Predictive Analytics with Spark Structured Streaming

In this workshop we will dive deep into what it takes to build and deliver an always-on “real-ish time” predictive analytics pipeline with Spark Structured Streaming.

The core focus of the workshop material will be on how to solve a common complex problem in which we have no labeled data in an unbounded timeseries dataset and need to understand the substructure of said chaos in order to apply common supervised and statistical modeling techniques to our data in a streaming fashion…more details

Instructor's Bio

Scott Haines is a Principal Software Engineer / Tech Lead on the Voice Insights team at Twilio. His focus has been on the architecture and development of a real-time (sub 250ms), highly available, trust-worthy analytics system. His team is providing near real-time analytics that processes / aggregates and analyzes multiple terabytes of global sensor data daily. Scott helped drive Apache Spark adoption at Twilio and actively teaches and consulting teams internally. Scott’s past experience was at Yahoo! where he built a real-time recommendation engine and targeted ranking / ratings analytics which helped serve personalized page content for millions of customers of Yahoo Games. He worked to build a real-time click / install tracking system that helped deliver customized push marketing and ad attribution for Yahoo Sports and lastly Scott finished his tenure at Yahoo working for Flurry Analytics where he wrote the an auto-regressive smart alerting and notification system which integrated into the Flurry mobile app for ios/android

Scott J Haines

Principal Software Engineer at Twilio

Workshop: Model Fairness in Practice

In this talk/tutorial/workshop we will focus on fairness and removing bias from your models. More specifically we will look at how to apply fairness principles in practice. We will define model bias and typical examples of it. We will talk about the benefits of unbiased models (or more realistically less biased models), we will consider the risks and costs (in reduced accuracy) that you need to incur and generally talk about the things to look out when implementing fair models in real-world scenarios…more details

Instructor's Bio

Senior Data Scientist at neptune.ml, graduated Physics at University of Silesia in Katowice and Finance at University of Economics in Wroclaw. Worked on various data science projects involving facial recognition, optical character recognition, cancer detection and classification, satellite image segmentation, text mining labor market data and many more. He was a member of the teams that won MICCAI Munich 2015 „Combined Imaging and Digital Pathology Classification Challenge”, won MICCAI Athens 2016 „Pet segmentation challenge using a data management and processing infrastructure” and won crowdAI “Mapping Challenge” competition in 2018.

Jakub Czakon

Senior Data Scientist at neptune.ml

WORKSHOP: Document Clustering with Open Source Tools

Document Clustering is a powerful application of Natural Language Processing techniques useful in topic modeling and extraction and information retrieval. A central challenge in document clustering is the scale of the data involved, making document clustering a prime candidate for development with the open source, distributed processing framework, Apache Spark. At the same time, tremendous strides have been made in the past few years in the development of other open source tools useful in NLP applications. This workshop will make heavy use of Apache Spark, but look at integrating these other tools (Gensim, Scikit-Learn) to the development process. Additionally, particular attention will be paid to NLP and document clustering as an application of Unsupervised Learning…more details

Instructor's Bio

Joshua Cook has been teaching in one capacity or another for nearly fifteen years. He currently works as a Curriculum Developer for Databricks. Most recently, he taught Data Science for UCLA Extension. Prior to this he taught Data Science for General Assembly, in the Master of Education program at UCLA, high school mathematics at Crenshaw and Jefferson High Schools in Los Angeles, and early childhood literacy in West Oakland. Additionally, Joshua is trained as a computational mathematician. He has production experience with model prediction and deployment using the Python numerical stack and Amazon Web Services. He is the author of the book, Docker for Data Science, published by Apress Media.

Joshua Cook, PhD

Data Architect at Databricks

Workshop: Mapping Geographic Data in R

In this hands-on workshop, we will use R to take public data from various sources and combine them to find statistically interesting patterns and display them in static and dynamic, web-ready maps. This session will cover topics including geojson and shapefiles, how to munge Census Bureau data, geocoding street addresses, transforming latitude and longitude to the containing polygon, and data visualization principles.

Participants will leave this workshop with a publication-quality data product and the skills to apply what they’ve learned to data in their field or area of interest.

Instructor's Bio

Joy Payton is a data scientist and data educator at the Children’s Hospital of Philadelphia (CHOP), where she helps biomedical researchers learn the reproducible computational methods that will speed time to science and improve the quality and quantity of research conducted at CHOP. A longtime open source evangelist, Joy develops and delivers data science instruction on topics related to R, Python, and git to an audience that includes physicians, nurses, researchers, analysts, developers, and other staff. Her personal research interests include using natural language processing to identify linguistic differences in a neurodiverse population as well as the use of government open data portals to conduct citizen science that draws attention to issues affecting vulnerable groups. Joy holds a degree in philosophy and math from Agnes Scott College, a divinity degree from the Universidad Pontificia de Comillas (Madrid), and a data science Masters from the City University of New York (CUNY).

Joy Payton

Supervisor, Data Education at Children’s Hospital of Philadelphia

workshop: How to Build a Recommendation Engine That Isn’t Movielens

Recommendation engines are pretty simple. Or at least, they are made to seem simple by an uncountable number of online tutorials. The only problem: it’s hard to find a tutorial that doesn’t use the ready-made and pre-baked MovieLens dataset. Fine. But, perhaps you’ve followed one of these tutorials and have struggled to imagine how to, or otherwise implement your own recommendation engine on your own data. In this workshop, I’ll show you how to use industry-leading open source tools to build your own engine and how to structure your own data so that it might be “recommendation-compatible”. Note: this workshop will be heavily tilted towards the applied-side of things. Hope you’re ready to get your hands dirty.

Instructor's Bio

Max is a Lead Instructor at General Assembly and an Apress Author. He likes climbing, making pottery, and fantasy sports. This will be his fifth ODSC!

Max Humber

Lead Instructor at General Assembly

Workshop: Data Harmonization for Generalizable Deep Learning Models: from Theory to Hands-on Tutorial

Integration of data from multiple sources, with and without labels, is a fundamental problem in transfer learning when models must be trained on a source data distribution that differs from one or more target data distributions. For example, in healthcare, models must flexibly inter-operate on large scale medical data gathered across multiple hospitals, each with confounding biases. Domain adaptation is a method for enabling this form of transfer learning by
simultaneously identifying deep feature representations that are invariant across domains (data sources), thereby enabling transfer learning to unseen data distributions…more details

Instructor's Bio

Gerald Quon is an Assistant Professor in the Department of Molecular and Cellular Biology at the University of California at Davis. He obtained his Ph.D. in Computer Science from the University of Toronto, M.Sc. in Biochemistry from the University of Toronto, and B. Math in Computer Science from the University of Waterloo. He also completed postdoctoral research training at MIT. His lab focuses on applications of machine learning to human genetics, genomics and health, and is funded by the National Science Foundation, National Institutes of Health, the Chan Zuckerberg Initiative, and the American Cancer Society.

Gerald Quon, PhD

Assistant Professor at UC Davis Machine Learning & AI Group

Workshop: Data Harmonization for Generalizable Deep Learning Models: from Theory to Hands-on Tutorial

Integration of data from multiple sources, with and without labels, is a fundamental problem in transfer learning when models must be trained on a source data distribution that differs from one or more target data distributions. For example, in healthcare, models must flexibly inter-operate on large scale medical data gathered across multiple hospitals, each with confounding biases. Domain adaptation is a method for enabling this form of transfer learning by
simultaneously identifying deep feature representations that are invariant across domains (data sources), thereby enabling transfer learning to unseen data distributions…more details

Instructor's Bio

Coming soon!

Nelson Johansen

PhD Candidate at UC Davis

Workshop: Imagination Inspired Vision

Imagination is one of the key properties of human intelligence that enables us not only to learn new concepts quickly and efficiently, but also to generate creative products like art and music. Dr. Mohamed Elhoseiny’s research focuses on developing imagination-inspired techniques that empower AI machines to see the world (computer vision) or to create novel products (e.g., fashion and art); “Imagine to See” and “Imagine to Create”. In this talk, he will cover some of his works on these two directions and will show how they are connected by the developed techniques and how they circle back to benefit each other...more details

Instructor's Bio

Dr. Mohamed Elhoseiny is Assistant Professor of Computer Science at the Visual Computing Center at KAUST (King Abdullah University of Science and Technology) and an AI Research consultant at Baidu Research at Silicon Valley AI Lab (SVAIL). He received his PhD from Rutgers university under Prof. Ahmed Elgammal in October 2016 then spent more than two years at Facebook AI Research (FAIR) until January 2019 as a Postdoc Researcher. His primary research interests are in computer visionand especially about learning about the unseen or the least unseen by recognition (e.g., zero-shot learning) or by generation (creative art and fashion generation). Under the umbrella of how AImay benefit biodiversity, Dr. Elhoseiny’s 6-years long development of the zero-shot learning problem was featured in the United Nations biodiversity conference in November 2018 (~10,000 audience from >192 countries). His creative AI research projects were recognized at the ECCV18 workshop on Fashion and Art with the best paper award, media coverage at the New Scientist Magazine and MIT Tech review (2017, 2018), 20 min speech at the Facebook F8 conference (2018), the official FAIR video (2018), and coverage at HBO Silicon Valley TV Series (2018).

Mohamed Elhoseiny, PhD

Assistant Professor, Visiting Faculty Scholar at KAUST, Stanford University

Workshop: Optuna: A Define-by-Run Hyperparameter Optimization Framework

In this workshop, we introduce Optuna, a next-generation hyperparameter optimization framework with new design-criteria: (1) define-by-run API that allows users to concisely construct dynamic, nested, or conditional search spaces, (2) efficient implementation of both sampling and early stopping strategies, and (3) easy-to-setup, versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to lightweight experiment conducted in a local laptop machine. Our software is available under the MIT license...more details

Instructor's Bio

Since his mathematics degree, Crissman has devoted himself to the study of languages, including Spanish, Javascript, German, Python, and Japanese. Previously, Crissman worked on open source projects for automation of game playing systems, including MMORPGs, web-based games, and Pokemon. After finding the limits of rule-based systems, he worked on Deep Learning programs at Preferred Networks, the company that created the AI Python framework Chainer.

Crissman Loomis

AI Engineer at Preferred Networks

Workshop: Deepfakes: Commodification, Consequences And Countermeasures

AI-generated audiovisual media has gathered significant attention from the news, social media platforms, security experts and governments alike. Since 2017, these realistic-seeming videos and audio have been referred to as “deepfakes”. The significance of deepfakes is explained by three elements. First, we overview capabilities and accessibility of the technology, and its trend towards commodification. Second, we discuss real-world cases of weaponization and other near-term consequences. Third and finally, we consider the technological countermeasures currently being investigated by the community. Throughout the workshop, we will go into the details of several state of the art techniques for the creation and detection of deepfakes.

Instructor's Bio

Giorgio Patrini is CEO and Chief Scientist at Deeptrace, an Amsterdam-based cybersecurity startup building deep learning technology for detecting and understanding fake videos. Previously, he was a postdoctoral researcher at the University of Amsterdam, working on deep generative models; and earlier at CSIRO Data61 in Sydney, Australia, building privacy-preserving learning systems with homomorphic encryption. He obtained his PhD in machine learning at the Australian National University. In 2012 he cofounded Waynaut, an Internet mobility startup acquired by lastminute.com in 2017.

Giorgio Patrini, PhD

CEO and Chief Scientist at Deeptrace

Workshop: Machine Learning Workflows For Software Engineers

The capabilities of intelligent applications often seem like magic to users, but the machine learning and artificial intelligence techniques that enable these features are more accessible than you might think. Developing intelligent features doesn’t require esoteric math or high-performance hardware, but it does require you to start with data rather than with code and to adapt your existing engineering practice to build and manage predictive models in addition to conventional software artifacts.
This hands-on tutorial will introduce machine learning workflows and concepts in the context of a concrete problem and show you how to integrate them into the application development work you’re already doing, focusing on the habits and processes that will help you to get meaningful results from predictive models…more details

Instructor's Bio

William Benton leads a team of data scientists and engineers at Red Hat, where he has focused on enabling machine learning workflows and data processing pipelines in cloud-native environments while solving some fun problems with data.

Will Benton

Manager, Software engineering and Senior Principal Software Engineer at RedHat

Workshop: Machine Learning Workflows For Software Engineers

The capabilities of intelligent applications often seem like magic to users, but the machine learning and artificial intelligence techniques that enable these features are more accessible than you might think. Developing intelligent features doesn’t require esoteric math or high-performance hardware, but it does require you to start with data rather than with code and to adapt your existing engineering practice to build and manage predictive models in addition to conventional software artifacts.
This hands-on tutorial will introduce machine learning workflows and concepts in the context of a concrete problem and show you how to integrate them into the application development work you’re already doing, focusing on the habits and processes that will help you to get meaningful results from predictive models…more details

Instructor's Bio

Sophie Watson is a data scientist in an Emerging Technology Group at Red Hat, where she applies her data science and statistics skills to solving business problems and informing next-generation infrastructure for intelligent application development.

Sophie Watson

Senior Data Scientist at RedHat

Workshop: Building an Industry classifier with the latest scraping, NLP and deployment tools

For BlueVine, and indeed for any Fintech company, figuring out the client’s industry is a critical factor in making precise financial decisions. Traditional sources are invariably pricey, inaccurate and unavailable, and as such leave an opening for an ML based solution. We met that challenge building a service that predicts the industry using the business’s publicly available web data. By employing the latest innovations in NLP (BERT) and the some of the most powerful scraping and deployment tools available (Scrapy and Amazon SageMaker) we were able to dramatically surpass the performance achieved by any other such tool in the space…more details

Instructor's Bio

Ido Shlomo is the head of BlueVine’s data science team in the US, where he works on applying machine learning and other automation solutions for risk management, fraud detection and marketing purposes. Recent work is focused on implementing complex NLP tasks in production systems, and specifically on dealing with the the challenge of consuming unstructured data. Previously Ido worked in the Economics department at Tel Aviv University as a researcher in structural macroeconomic modeling. Ido holds a dual BA in mathematics and philosophy and an MA in economics, both from Tel Aviv University.

Ido Shlomo

Senior Data Science Manager at

BlueVine

Workshop: The Power of Workflows

Data Science is a field with immense breadth and depth. As our toolkit grows and we learn how to chain models and pipelines together, there’s no limit to the time and complexity we can devote to over-engineering a solution. Many data scientists seem to be much better at learning algorithms and software libraries than they are at identifying good business cases to solve, and designing workflows that will enable them to work productively from start to finish.
In this workshop we look into the planning and design process of data science projects and explore strategies for building efficient workflows that lead to fast prototyping and seamless iterations of machine learning models. We also consider the human side of data science research and look at how increased understanding of our own cognitive processes can limit the impact that our biases and assumptions may have on our current and future work.

Instructor's Bio

Cliff is a Senior Data Scientist and Instructor at Metis, where he teaches a 12-week immersive data science bootcamp covering machine learning, math and statistics, and the Python data science ecosystem. Previously he has worked on user segmentation analysis for Microsoft Office, trained natural language processing models for Cortana, built email classification and event extraction algorithms for Outlook, and automated demand forecasting for Amazon Fresh. He has also spent four years as a hedge fund quant in Chicago, after obtaining Master’s degrees in Statistics and Economics from the University of Chicago.

Cliff Clive

Senior Data Scientist at Metis

Workshop: GPU-Accelerating Data Exploration & Machine Learning with RAPIDS

RAPIDS is an open source initiative to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several projects that expose familiar interfaces, making it easy to accelerate the entire data science pipeline – from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.

This presentation targets data scientists familiar with the Python data science ecosystem, which includes Pandas, Numpy, and Scikit-learn. A very brief overview of the RAPIDS ecosystem will get us kicked off, followed by an in-depth overview of cuML, the RAPIDS machine learning library…more details

Instructor's Bio

Coming Soon!

John Zedlewski

Director, GPU-accelerated machine learning at NVIDIA

Workshop: 10 Things You Didn't Know About TensorFlow in Production

In this talk, I highlight 10 powerful, yet unknown, aspects of TensorFlow based on years of profiling, tuning, and debugging CPU/GPU/TPU-based TensorFlow in production.

Key takeaways include techniques for profiling and tuning TensorFlow Core, TensorFlow Serving, and TensorFlow Lite.

In addition to model training optimizations such as batch normalization and XLA, I demonstrate post-training optimizations including 8-bit quantization and layer-fusing…more details

Instructor's Bio

Chris Fregly is Founder and Applied AI Engineer at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, “High Performance TensorFlow in Production with Kubernetes and GPUs.” Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.

Chris Fregly

Founder & CEO at PipelineAI

Workshop: Make beautiful web apps from Jupyter notebooks

Are you a data scientist looking for new and powerful ways to ship data products to users within your organization?
This session offers a crash course in new open source Python, Jupyter and PyData tools to rapidly prototype interactive apps that are as expressive and powerful as any notebook you can build!
Come and learn how to quickly turn Jupyter notebooks – a central element in a data science workflow – into beautiful web apps to share with team members and stakeholders in your organization!…more details

Instructor's Bio

Michal leads a successful data science consultancy, delivering strategy and execution on data science, data engineering and ML projects, with clients in retail, transportation, finance, film, and building automation. Beyond technical contributions, he has developed successful approaches to create lasting change in the alignment in organizational collaboration, data literacy, and unlock unrealized human potential that is being stopped by elusive inefficiencies in process and culture.
He holds a Masters degree in Econometrics and Information engineering form Poznan University of Economics. He participated in commercial research projects, including mobile phone data research with published results.

Michal Mucha

Senior Data Scientist (independent consultant) at create.ml

Tutorial Sessions

More sessions added weekly

Tutorial: Autonomous Driving: Simulation and Navigation

Autonomous driving has been an active area of research and development over the last decade. Despite considerable progress, there are many open challenges including automated driving in dense and urban scenes. In this talk, we give an overview of our recent work on simulation and navigation technologies for autonomous vehicles. We present a novel simulator, AutonoVi-Sim, that uses recent developments in physics-based simulation, robot motion planning, game engines, and behavior modeling. We describe novel methods for interactive simulation of multiple vehicles with unique steering or acceleration limits taking into account vehicle dynamics constraints. In addition, AutonoVi-Sim supports navigation for non-vehicle traffic participants such as cyclists and pedestrians AutonoVi-Sim also facilitates data analysis…more details

Instructor's Bio

Dinesh Manocha is the Paul Chrisman Iribe Chair in Computer Science & Electrical and Computer Engineering at the University of Maryland College Park. He is also the Phi Delta Theta/Matthew Mason Distinguished Professor Emeritus of Computer Science at the University of North Carolina – Chapel Hill. He has won many awards, including Alfred P. Sloan Research Fellow, the NSF Career Award, the ONR Young Investigator Award, and the Hettleman Prize for scholarly achievement. His research interests include multi-agent simulation, virtual environments, artificial intelligence, and robotics. His group has developed a number of packages for multi-agent simulation, crowd simulation, and physics-based simulation that have been used by hundreds of thousands of users and licensed to more than 60 commercial vendors. He has published more than 510 papers and supervised more than 36 PhD dissertations. He is an inventor of 10 patents, several of which have been licensed to industry. His work has been covered by the New York Times, NPR, Boston Globe, Washington Post, ZDNet, as well as DARPA Legacy Press Release. He is a Fellow of AAAI, AAAS, ACM, and IEEE, member of ACM SIGGRAPH Academy, and Pioneer of Solid Modeling Association. He received the Distinguished Alumni Award from IIT Delhi the Distinguished Career in Computer Science Award from Washington Academy of Sciences. He was a co-founder of Impulsonic, a developer of physics-based audio simulation technologies, which was acquired by Valve Inc in November 2016.

Dinesh Manocha, PhD

Distinguished Professor at the University of Maryland

Tutorial: Matrix and Tensor Estimation in Action

In this workshop, we will provide an overview of techniques for matrix and tensor estimation. We will showcase a wide variety of applications for matrix estimation in analyzing large heterogeneous datasets that may have missing or incorrect entries, including retail, causal inference, sports and networks. These applications will form the basis for some practical demos with opportunities for hands-on experience. Subsequently we will explain the intuition for matrix and tensor estimation algorithms, with a focus on collaborative filtering.

Instructor's Bio

Christina Lee Yu is an Assistant Professor at Cornell University in Operations Research and Information Engineering. Prior to Cornell, she was a postdoc at Microsoft Research New England. She received her PhD in 2017 and MS in 2013 in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in the Laboratory for Information and Decision Systems. She received her BS in Computer Science from California Institute of Technology in 2011. She received honorable mention for the 2018 INFORMS Dantzig Dissertation Award. Her research focuses on designing and analyzing scalable algorithms for processing social data based on principles from statistical inference.

Christina Lee Yu, PhD

Assistant Professor at Cornell University

Tutorial: Recent Advances in Population-Based Search for Deep Neural Networks: Quality Diversity, Indirect Encodings, and Open-Ended Algorithms

We will cover new, exciting, unconventional techniques for improving population-based search. These ideas are already enabling us to solve hard problems. They also hold great promise for further advancing machine learning, including deep neural networks. Major topics covered include (1) explicitly searching for behavioral diversity (in a low-dimensional space where diversity is inherently interesting, such as the behavior of robots, rather than in the true search space, such as the weights of the DNN that controls the robot), especially Quality Diversity algorithms, which have produced state-of-the-art results in robotics and solved a version of the hard-exploration RL challenge of Montezuma’s Revenge…more details

Instructor's Bio

Jeff Clune is the Loy and Edith Harris Associate Professor in Computer Science at the University of Wyoming and a Senior Research Manager and founding member of Uber AI Labs, which was formed after Uber acquired a startup he was a part of. Jeff focuses on robotics and training deep neural networks via deep learning, including deep reinforcement learning. Since 2015, he won the Presidential Early Career Award for Scientists and Engineers from the White House, had papers on the cover of Nature and PNAS, won an NSF CAREER award, received an Outstanding Paper of the Decade award, and had best paper awards, oral presentations, and invited talks at the top machine learning conferences (NeurIPS, CVPR, ICLR, and ICML). His research is regularly covered in the press, including the New York Times, NPR, NBC, Wired, the BBC, the Economist, Science, Nature, National Geographic, the Atlantic, and the New Scientist. Prior to becoming a professor, he was a Research Scientist at Cornell University and received degrees from Michigan State University (PhD, master’s) and the University of Michigan (bachelor’s). More on Jeff’s research can be found at JeffClune.com.

Jeff Clune, PhD

Senior Research Manager | Harris Associate Professor at Uber AI Labs | Computer Science at the University of Wyoming

Tutorial: Matrix and Tensor Estimation in Action

In this workshop, we will provide an overview of techniques for matrix and tensor estimation. We will showcase a wide variety of applications for matrix estimation in analyzing large heterogeneous datasets that may have missing or incorrect entries, including retail, causal inference, sports and networks. These applications will form the basis for some practical demos with opportunities for hands-on experience. Subsequently we will explain the intuition for matrix and tensor estimation algorithms, with a focus on collaborative filtering.

Instructor's Bio

Coming Soon!

Muhammad Jehangir Amjad, PhD

Software Engineer at Google

Tutorial: Principled Methods for Analyzing Weight Matrices of Modern Production Quality Neural Networks

An important practical challenge is to develop theoretically-principled tools that can be used to guide the use of production-scale deep neural networks. We will describe recent work that has focused on using spectral-based methods from scientific computing and statistical mechanics to develop such tools. Among other things, these tools can be used to develop metrics characterizing the quality of models, without even examining training or test data; and they can be used to predict trends in generalization (and not just bounds on generalization) for state-of-the-art production-scale models…more details

Instructor's Bio

Michael Mahoney is at ICSI and Department of Statistics at UC Berkeley. He works on algorithmic and statistical aspects of modern large-scale data analysis. He is a leader in Randomized Numerical Linear Algebra; he led the largest large-scale empirical evaluation of community structure in social and information networks; he has developed implicit regularization methods and scalable optimization methods for convex and non-convex problems; and he has applied these methods and complementary RMT methods to DNN problems.

Michael Mahoney, PhD

Professor at UC Berkeley

Tutorial: Principled Methods for Analyzing Weight Matrices of Modern Production Quality Neural Networks

An important practical challenge is to develop theoretically-principled tools that can be used to guide the use of production-scale deep neural networks. We will describe recent work that has focused on using spectral-based methods from scientific computing and statistical mechanics to develop such tools. Among other things, these tools can be used to develop metrics characterizing the quality of models, without even examining training or test data; and they can be used to predict trends in generalization (and not just bounds on generalization) for state-of-the-art production-scale models…more details

Instructor's Bio

Charles Martin holds a PhD in Theoretical Chemistry from the University of Chicago. He was then an NSF Postdoctoral Fellow and worked in a Theoretical Physics group at UIUC that studied the statistical mechanics of Neural Networks. He currently owns and operates Calculation Consulting, a boutique consultancy specializing in ML and AI, supporting clients doing applied research in AI. He maintains a well-recognized blog on practical ML theory and he has to date supported and performed the work on Implicit and Heavy Tailed Self Regularization in Deep Learning.

Charles Martin, PhD

CEO at Calculation Consulting

Tutorial: Tackling Climate Change with Machine Learning

Climate change is one of the greatest challenges facing humanity, and data scientists may wonder how we can help. In this talk, we will see how machine learning can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate. From smart grids to disaster management, we explore high impact problems where existing gaps can be filled by machine learning, in collaboration with other fields. These problems lead to exciting research questions as well as promising business opportunities.

Instructor's Bio

David Rolnick is an NSF Mathematical Sciences Postdoctoral Research Fellow at the University of Pennsylvania. His research focuses on the mathematical foundations of deep learning. David is co-founder of Climate Change AI, an organization dedicated to furthering applications of machine learning that meaningfully address the climate crisis.

David Rolnick, PhD

Postdoctoral Research Fellow at University of Pennsylvania

Tutorial: Deep Implicit Learning

There are many benefits of implicit models. In the talk, I will provide an overview of implicit learning and detail some exciting developments towards robustness, interpretability, and architecture optimization.

Instructor's Bio

Laurent graduated from Ecole Polytechnique (Palaiseau, France) in 1985, and obtained his PhD in Aeronautics and Astronautics at Stanford University in 1990. Laurent joined the EECS department at Berkeley
1999, then went on leave in 2003-2006 to work for SAC Capital Management. He teaches optimization and data science in EECS and within the Masters of Financial Engineering at the Haas School of Business. Laurent’s research focuses on sparse and robust optimization and applications to data science, with a focus on finance. In 2016 Laurent co-founded Kayrros S.A.S., a company that delivers physical asset information for the energy markets from various sources such as satellite imagery; in 2018 he co-founded SumUp Analytics, which provides high-speed streaming text analytics for business applications.

Laurent El Ghaoui, PhD

Professor at University of California – Berkeley, BAIR Lab

Tutorial: The Robustness Problem

Despite impressive performance on many benchmarks, state of the art machine learning algorithms have been shown to be extremely brittle in the presence of distribution shift. In this talk we will survey several recent works in the literature on robustness, discussing known causes this brittleness and best methods for mitigating the problem. We will focus on robustness in the image domain, where models have been shown to easily latch onto spurious correlations in the data. We will also discuss how the popular notion of adversarial examples relates to the problem of distribution shift.

Instructor's Bio

Justin is a Research Scientist at Google Brain working on statistical machine learning and artificial intelligence. Much of his current focus is on building robust statistical classifiers that can generalize well in dynamic environments in the real world. He holds a PhD in Theoretical Mathematics from Rutgers University.

Justin Gilmer, PhD

Research Scientist at Google Brain

Tutorial: When the Bootstrap Breaks

Resampling methods like the bootstrap are becoming increasingly common in modern data science. For good reason too; the bootstrap is incredibly powerful. Unlike t-statistics, the bootstrap doesn’t depend on a normality assumption nor require any arcane formulas. You’re no longer limited to working with well understood metrics like means. One can easily build tools that compute confidence for an arbitrary metric. What’s the standard error of a Median? Who cares! I used the bootstrap.

With all of these benefits the bootstrap begins to look a little magical. That’s dangerous. To understand your tool you need to understand how it fails, how to spot the failure, and what to do when it does. As it turns out, methods like the bootstrap and the t-test struggle with very similar types of data…more details

Instructor's Bio

Ryan Harter is a Senior-Staff Data Scientist with Mozilla working on Firefox. He has years of experience solving business problems in the technology and energy industries both as a data scientist and data engineer. Ryan shares practical advice for applying data science as a mentor and in his blog.

Ryan Harter

Senior Staff Data Scientist at Mozilla

Tutorial: When the Bootstrap Breaks

Resampling methods like the bootstrap are becoming increasingly common in modern data science. For good reason too; the bootstrap is incredibly powerful. Unlike t-statistics, the bootstrap doesn’t depend on a normality assumption nor require any arcane formulas. You’re no longer limited to working with well understood metrics like means. One can easily build tools that compute confidence for an arbitrary metric. What’s the standard error of a Median? Who cares! I used the bootstrap.

With all of these benefits the bootstrap begins to look a little magical. That’s dangerous. To understand your tool you need to understand how it fails, how to spot the failure, and what to do when it does. As it turns out, methods like the bootstrap and the t-test struggle with very similar types of data…more details

Instructor's Bio

Saptarshi Guha is a Senior Staff Data Scientist with Mozilla working across domains at Firefox from marketing and software quality to product development. He has been at Firefox for seven years and witnessed the data team grow from ‘two guys and a dog’ to a sophisticated collaboration between product, data engineering and data science.

Saptarshi Guha, PhD

Senior Staff Data Scientist at Mozilla

Tutorial: Validate and monitor your AI and machine learning models

You’ve created a wicked AI or machine learning model that change the way you do business.

Good job.

But how do you validate your model and monitor it in the long run?
Advanced machine learning and AI models gets more and more powerful. They also tend to become more complicated to validate and monitor. This has a major impact in business’ adoption of models. Initial validation and monitoring are not only critical to ensure the model’s sound performance, but they are also mandatory in some industries like banking and insurance…more details

Instructor's Bio

Olivier is a data science expert whose leading field of expertise and cutting-edge knowledge of AI and machine learning led him to support many companies’ digital transformations, as well as implementing projects in different industries. He has led the data team and put in place a data culture in companies like Pratt & Whitney Canada, L’Oréal, GSoft and now Moov AI.

Olivier Blais

Co-Founder and Data Scientist at Moov AI

Tutorial: Learning From Limited Data

Tremendous progress has been achieved across a wide range of machine learning tasks with the introduction of deep learning in recent years. However, conventional deep learning approaches rely on large amounts of labeled data and suffer from performance decay in problems with limited training data. On the one hand, objects in the real world have a long-tailed distribution, and obtaining annotated data is expensive. On the other hand, novel categories of objects arise dynamically in nature, which fundamentally limits the scalability and applicability of supervised learning models for handling this dynamic scenario when labeled examples are not available. Take the surveillance traffic analysis as an example. Current solutions need examples that span all weather conditions, times of day, cities of operation, and camera locations, to produce a model robust to these variations…more details

Instructor's Bio

Dr. Shanghang Zhang is currently a researcher at Petuum Inc.. Her research covers deep learning, computer vision, and natural language processing. She especially focuses on domain adaptation, meta-learning, and low-shot learning. She has been awarded “2018 Rising Stars in EECS” (a highly selective program launched at MIT in 2012, and which has since been hosted at UC Berkeley, Carnegie Mellon, and Stanford annually). She is the recipient of Adobe Academic Collaboration Fund and Qualcomm Innovation Fellowship (QInF) Finalist Award. She was also selected to CVPR 2018 Doctoral Consortium and invited to Facebook’s 3rd Annual Women in Research Lean In Event. Shanghang has co-organized Human in Loop Learning Workshop for ICML 2019, and the Special Session “MMA: Multi-Modal Affective Computing of Large-Scale Multimedia Data” for ACM International Conference on Multimedia Retrieval (ICMR), 2019. Before joining Petuum, Shanghang received her Ph.D. from Carnegie Mellon University, supervised by Prof. Jose Moura and Prof. Joao Costeira.

Shanghang Zhang, PhD

Postdoctoral Researcher at Carnegie Mellon

Sign Up for ODSC West | Oct 29th – Nov 1st, 2019

Register Now

Highly Experienced Instructors

Our instructors are highly regarded in data science, coming from both academia and notable companies.

Real World Applications

Gain the skills and knowledge to use data science in your career and business, without breaking the bank.

Cutting Edge Subject Matter

Find training sessions offered on a wide variety of data science topics, from machine learning to data visualization.

ODSC Training Includes

Opportunities to form working relationships with some of the world’s top data scientists for follow-up questions and advice.

Access to 40+ training sessions  and 50 workshops.

Hands-on experience with the latest frameworks and breakthroughs in data science.

Affordable training–equivalent training at other conferences costs much more.

Professionally prepared learning materials, custom- tailored to each course.

Opportunities to connect with other ambitious, like-minded data scientists.