ODSC East 2019 Warm-Up: DataOps
Haftan Eckholdt, Ph.D.
Chief Data Science & Chief Science Officer, Understood.org
Making Data Science: AIG, Amazon, Albertsons
Developing an internal data science capability requires a cultural shift, a strategic mapping process that aligns with existing business objectives, a technical infrastructure that can host new processes, and an organizational structure that can alter business practice to create a measurable impact on business functions. This workshop will take you through ways to consider the vast opportunities for data science to identify and prioritize what will add the most value to your organization, and then budget and hire into commitments. Learn the most effective ways to establish data science objectives from a business perspective including recruiting, retention, goal setting, and improving business.
Haftan Eckholdt, PhD. is Chief Data Science Office at Understood.org. His career began with research professorships in Neuroscience, Neurology, and Psychiatry followed by industrial research appointments at companies like Amazon and AIG. He holds graduate degrees in Biostatistics and Developmental Psychology from Columbia and Cornell Universities. In his spare time, he thinks about things like chess and cooking and cross country skiing and jogging and reading. When things get really really busy, he actually plays chess and cooks delicious meals and jogs a lot. Born and raised in Baltimore, Haftan has been a resident of Kings County, New York since the late 1900s.
Christopher P. Berg
CEO, Head Chef, DataKitchen
The DataOps Manifesto
The list of failed big data projects is long. They leave end-users, data analysts and data scientists frustrated with long lead times for changes. This presentation will illustrate how to make changes to big data, models, and visualizations quickly, with high quality, using the tools analytic teams love. We synthesize DevOps, Demming, and direct experience into the DataOps Manifesto.
To paraphrase an old saying: “It takes a village to get insights from data.” Data analysts, data scientists, and data engineers are already working in teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up failing? Christopher Bergh presents the seven shocking steps to get these groups of people working together. These seven steps contain practical, doable steps that can help you achieve data agility.
After looking at trends in analytics and a brief review of Agile, Christopher outlines the steps to apply DevOps techniques from software development to create an Agile analytics operations environment, including how to add tests, modularize and containerize, do branching and merging, use multiple environments, parameterize your process, use simple storage, and use multiple workflows deploy to production with W. Edwards Deming efficiency. They also explain why “don’t be a hero” should be the motto of analytic teams—emphasizing that while being a hero can feel good, it is not the path to success for individuals in analytic teams.
Christopher’s goal is to teach analytic teams how to deliver business value quickly and with high quality. They illustrate how to apply Agile processes to your department. However, a process is not enough. Walking through the seven shocking steps will demonstrate how to create a technical environment that truly enables speed and quality by supporting DataOps.
Christopher Bergh is a Founder and Head Chef at DataKitchen.
Chris has more than 20 years of research, engineering, analytics, and executive management experience. Previously, Chris was Regional Vice President in the Revenue Management Intelligence group in Model N. Before Model N, Chris was COO of LeapFrogRx and analytics software and service provider. Chris led the acquisition of LeapFrogRx by Model N in January 2012. Prior to LeapFrogRx Chris was CTO and VP of Product Management of MarketSoft (now part of IBM) an Enterprise Marketing Management software vendor. Prior to that, Chris developed Microsoft Passport, the predecessor to Windows Live ID, a distributed authentication system used by 100s of Millions of users today. He was awarded a US Patent for his work on that project. Before joining Microsoft, he led the technical architecture and implementation of Firefly Passport, an early leader in Internet Personalization and Privacy. Microsoft subsequently acquired Firefly. Chris led the development of the first travel-related e-commerce web site at NetMarket. Chris began his career at the Massachusetts Institute of Technology’s (MIT) Lincoln Laboratory and NASA Ames Research Center. There he created software and algorithms that provided aircraft arrival optimization assistance to Air Traffic Controllers at several major airports in the United States. Chris served as a Peace Corps Volunteer Math Teacher in Botswana, Africa. Chris has an M.S. from Columbia University and a B.S. from the University of Wisconsin-Madison. He is an avid cyclist, hiker, reader, and father of two teenagers.
More speakers will be announced soon!
ODSC East Ignite Accelerate AI Webinar Warmup
Senior Curriculum Lead, DataCamp
Building an Analytics Team
Based on her experience of building analytics teams from the ground up, Hillary will walk through the process of creating an analytics team.
We’ll begin by examining why analytics teams exist and how they are different from Data Science teams. Next, we’ll discuss possible structures for the analytics team, including embedded, independent, and hybrid structures.
We’ll talk about best practices in hiring a diverse and talented analytics team, including good interview questions, and interview tools, such as CoderPad to ensure that applicants have the necessary skill set.
Once the team is up and running, it needs to integrate with Product teams. Creating best practices around data creation and experimental design can make sure that your team is involved early before problems can surface.
Success can bring challenges, such as too many under-defined requests. Creating a ticketing system unique to your team can ensure that ad hoc requests can be handled in a systematic and efficient manner. This is key to scaling an analytics team.
There are many approaches to becoming the voice of data at a company. Building a data reporting ecosystem ensure that all internal clients have access to what they need when they need it. The talk will cover dashboarding, alert systems, and data newsletters. Finally, we’ll discuss promoting responsible data conception through continuous training in statistics and tooling for all members of an organization.
Hillary is a Senior Curriculum Lead at DataCamp. She is an expert in creating a data-driven product and curriculum development culture, having built the Product Intelligence team at Knewton and the Data Science team at Codecademy. She enjoys explaining data science in a way that is understandable to people with both PhDs in Math and BAs in English.
Customer Success Team Lead, Dataiku
Building and Managing World-Class Data Science Teams (Easier Said Than Done)
Despite the promise and opportunities of data science, many organizations are failing to see a return on their investment. The key issue holding organizations back is a lack of good data science management. This manifests in failure to effectively build and manage teams. In this workshop, we will go through a methodological approach for helping managers identify the needs of their organization and build the appropriate team. We will learn how to:
1 – put in place the appropriate foundational elements
2- select and recruit the right team
3 – develop and manage that team to success
4- create pipelines of good data science managers and technical rock stars
Conor Jensen is an experienced Data Science executive with over 15 years working in the analytics space across multiple industries as both a consumer and developer of analytics solutions. He is the founder of Renegade Science, a Data Science strategy and coaching consultancy and works as a Customer Success Team Lead at Dataiku, helping customers make the most of their Data Science platform and guiding them through building teams and processes to be successful. He has worked at multiple Data Science platform startups and has successfully built out analytics functions at two multinational insurance companies. This includes building out data and analytics platforms, Business Intelligence capabilities, and Data Science teams serving both internal and external customers.
Before moving to insurance, Conor was a Weather Forecaster in the US Air Force supporting operations in Southwest Asia. After leaving the military, Conor spent a number of years in store management at Starbucks Coffee while serving as an Emergency Management Technician in the Illinois Air National Guard.
Conor earned his Bachelor of Science degree in Mathematics from the University of Illinois at Chicago.
Adam Jenkins, Ph.D.
Data Science Lead, Biogen
Integrating Data Science into Commercial Pharma: The Good, The Bad, and The Validated
One of the most difficult industries for data science to take hold and gain effectiveness is the world of commercial pharma/biotech. Due to the regulation of FDA, lack of identifiable patient data, and one of the last industries that use a “traveling salesperson” approach, data science is still taking hold in this industry. This talk will talk in depth about steps that companies in this space can take to make the most out of their data science teams and out of their data in general. These steps will include standardizing internal data, utilizing 3rd party data in unique methodologies, bearing the course during marketing and sales initiatives, and creating validation methods.
We will dive into these issues through the context of how to bring the industry from one of “old school” sales and marketing techniques into one where machine learning can make a tangible top and bottom line impacts. Through this lens, we will identify areas of opportunity that should first be tackled by any organization and those areas which are often pitfalls (even though they may seem lucrative). Additionally, an ideal team make-up and timeline will be outlined so that these companies can level-set where they are and where they can improve their data science processes.
Adam Jenkins is a Data Science Lead at Biogen, where he works on optimizing commercial outcomes through marketing, patient outreach, and field force infrastructure utilizing data science and predictive analytics. Biogen is a leader in the treatment and research of neurological diseases for 40 years. Prior to being commercial lead, Adam was part of their Digital Health team where he worked on the next-generation application of wearable and neurological tests. Holding a Ph.D. in genomics, he also teaches management skills for data science and big data initiatives at Boston College.
Jennifer Kloke, Ph.D.
VP of Product innovation, Ayasdi
AI and Value-Based Care: Reducing Costs and Enhancing Patient Outcomes
Politics aside, value-based care is the model that is transforming the practice and compensation of healthcare in the United States. Once laggards, payers, and providers are increasingly becoming sophisticated enterprises when it comes to data and the implications for healthcare are staggering. What lies within that data has the power to cure disease, reduce readmissions, enable precision medicine, improve population health, detect fraud and reduce waste.
Take Flagler Hospital, a 335-bed hospital in St. Augustine, Florida. They don’t have a single data scientist on staff. Nonetheless, they have orchestrated one of the most successful deployments of artificial intelligence in healthcare — delivering cost savings of more than 30%, reducing the length of stay by days and reducing readmissions by a factor of more than 7X.
In this talk, Dr. Jennifer Kloke, VP of Product Innovation at Ayasdi, will walk through how healthcare institutions small and large will be able to apply artificial intelligence in the pursuit of value-based care. She can discuss the strategy, implementation, and results seen to date and go over how these advances are transforming the healthcare industry.
Dr. Jennifer Kloke is the VP of Product Innovation at Ayasdi. For the last three years, she has been responsible for the automation and algorithm development for the entire Ayasdi codebase and led many efforts to development cutting edge analysis techniques utilizing TDA and AI. During that time, she was the principal investigator for a Phase 2 DARPA SBIR developing automation and data fusion capabilities. These have led to breakthroughs in the field and several patents. Jennifer also served five years as a Senior Data Scientist analyzing a wide variety of data including point cloud, text, and networks from diverse industries including large military contractors, finance, bio-tech, and electronics manufacturing. Her work includes developing prediction algorithms for reducing the number of false alarms for a large military jet manufacturer as well as developing and deploying a predictive program management application at a large government contractor.
Jennifer received her Ph.D. in Mathematics from Stanford University with an emphasis on topological data analysis. She has collaborated with chemists at Lawrence Berkeley National Laboratory and UC Berkeley to develop topological methods to mine large databases of chemical compounds to identify energy-efficient compounds for carbon capture. She also developed a de-noising algorithm to efficiently process high dimensional data and has published in the Journal of Differential Geometry.
ODSC East 2019 Warm-Up: Open Source
Founder, Dunder Data
Integrating Pandas with Scikit-Learn, an Exciting New Workflow
In this hands-on tutorial, we will use these new additions to Scikit-Learn to build a modern, robust, and efficient workflow for those starting from a Pandas DataFrame. There will be ample practice problems and detailed notes available so that you can use it immediately upon completion.
Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.
PyTorch Examples for the Most Common Neural Net Mistakes
It takes years to build intuition and tricks of the trade. Alternatively, we can learn the basics from the greats and focus on greater challenges. With deep learning and computer vision, there are many pitfalls and hacks to work around and debug. On June 30th, 2018, Andrej Karpathy, Director of AI at Tesla, tweeted a short list of first things to check when your neural network isn’t working.
In this session, you will see what these mistakes look like in code and performance metrics. Using a computer vision dataset and a PyTorch code sample – we’ll walk through each of these pieces of advice, test it and explain it. Expect a technical deep dive and a review of best practices when debugging a PyTorch computer vision experiment.
Yuval Greenfield has been an engineer and data enthusiast for the past 13 years in the fields of military cybersecurity, computer vision medical diagnostics, gaming, 360 cameras, and deep-learning tools. He holds a B.Sc. in Physics and Mathematics from the Hebrew University of Jerusalem as part of the IDF Talpiot program. At MissingLink, Yuval is in charge of developer relations, using the MissingLink platform for deep learning research, building tutorials, marketing content, and technical presentations.
Supervisor, Data Education, Children’s Hospital of Philadelphia
Mapping Geographic Data in R
In this hands-on workshop, we will use R to take public data from various sources and combine them to find statistically interesting patterns and display them in static and dynamic, web-ready maps. This session will cover topics including geojson and shapefiles, how to munge Census Bureau data, geocoding street addresses, transforming latitude and longitude to the containing polygon, and data visualization principles.
Joy Payton is a data scientist and data educator at the Children’s Hospital of Philadelphia (CHOP), where she helps biomedical researchers learn the reproducible computational methods that will speed time to science and improve the quality and quantity of research conducted at CHOP. A longtime open source evangelist, Joy develops and delivers data science instruction on topics related to R, Python, and git to an audience that includes physicians, nurses, researchers, analysts, developers, and other staff. Her personal research interests include using natural language processing to identify linguistic differences in a neurodiverse population as well as the use of government open data portals to conduct citizen science that draws attention to issues affecting vulnerable groups. Joy holds a degree in philosophy and math from Agnes Scott College, a divinity degree from the Universidad Pontificia de Comillas (Madrid), and a data science Masters from the City University of New York (CUNY).
Daniel Parton, Ph.D.
Lead Data Scientist, Bardess Group
Analyzing Legislative Burden upon Businesses Using NLP and ML
In this hands-on workshop, we’ll first describe the legislative/business context for the initiative, then walk attendees through the technical implementation. The work will be conducted by combining various techniques from the NLP toolbox, such as entity recognition, part-of-speech tagging, automatic summarization, and topic modeling. Work will be conducted in Python, making use of libraries for NLP such as spacy and nltk, and the ML library scikit-learn. We will also showcase interactive dashboards which have been created using the BI tool Qlik to allow exploration of the results of the analysis.
Daniel Parton & Serena Peruzzo (co-presenters) bios
Dr. Daniel Parton leads the data science practice at the analytics consultancy, Bardess. He has a background in academia, including a Ph.D. in computational biophysics from University of Oxford, and previously worked in marketing analytics at Omnicom. He brings both technical and management experience to his role of leading cross-functional data analytics teams and has led successful and impactful projects for companies in finance, retail, tech, media, manufacturing, pharma, and sports/entertainment industries.
Serena Peruzzo is a senior data scientist at the analytics consultancy, Bardess. Her formal background is in Statistics with experience working both in the industry and academia. She has worked as a consultant on the Australian, British and Canadian markets delivering data science solutions across a broad range of industries and led several startups through the process of bootstrapping their data science capabilities.
ODSC East 2019 Warm-Up: AI for Engineers
President, Enplus Advisors Inc.
Programming with Data: Python and Pandas
Whether in R, MATLAB, Stata, or python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.
In this workshop, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.
Daniel Gerlanc has worked as a data scientist for more than decade and written software professionally for 15 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors. At Enplus, he works with clients on data science and custom software development with a particular focus on projects requiring an expertise in both areas. He teaches data science and software development at introductory through advanced levels. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups.
Principal Software Engineer, Twilio
Real-ish Time Predictive Analytics with Spark Structured Streaming
In this workshop we will dive deep into what it takes to build and deliver an always-on “real-ish time” predictive analytics pipeline with Spark Structured Streaming.
The core focus of the workshop material will be on how to solve a common complex problem in which we have no labeled data in an unbounded timeseries dataset and need to understand the substructure of said chaos in order to apply common supervised and statistical modeling techniques to our data in a streaming fashion.
Scott Haines is a full stack engineer with a current focus on real-time, highly available, trust-worthy analytics systems. He is currently working at Twilio (as Principal Engineer / Tech Lead of the Voice Insights team) where he helped drive spark adoption and streaming pipeline architectures. Prior to Twilio, he worked writing the backend java API’s for Yahoo Games, as well as the real-time game ranking/ratings engine (built on Storm) to provide personalized recommendations and page views for 10 million customers. He finished his tenure at Yahoo working for Flurry Analytics where he wrote the alerts/notifications system for mobile.
Leonardo De Marchi
Head of Data Science and Analytics, Badoo
Modern and Old Reinforcement Learning
Reinforcement Learning recently progressed greatly in the industry as one of the best techniques for sequential decision making and control policies.
In this presentation we will explore Reinforcement Learning, starting from its fundamentals and ending creating our own algorithms.
We will use OpenAI gym to try our RL algorithms.
We then will also explore other RL frameworks and more complex concepts like Policy gradients methods and Deep Reinforcement learning, which recently changed the field of Reinforcement Learning.
Leonardo De Marchi holds a Master in Artificial intelligence and has worked as a Data Scientist in the sport world, with clients such as New York Knicks and Manchester United, and with large social networks, like Justgiving.He now works as Lead Data Scientist in Badoo, the largest dating site with over 360 million users, he is also the lead instructor at ideai.io, a company specialized in Deep Learning and Machine Learning training and a contractor for the European Commission.
Sourav Dey, PhD
Reproducible Data Science Using Orbyter
Artificial Intelligence is already helping many businesses become more responsive and competitive, but how do you move machine learning models efficiently from research to deployment at enterprise scale? It is imperative to plan for deployment from day one, both in tool selection and in the feedback and development process. Additionally, just as DevOps is about people working at the intersection of development and operations, there are now people working at the intersection of data science and software engineering who need to be integrated into the team with tools and support.
At Manifold, we’ve developed the Lean AI process to streamline machine learning projects and the open-source Orbyter package for Docker-first data science to help your engineers work as an an integrated part of your development and production teams. In this workshop, Sourav and Alex will focus heavily on the DevOps side of things, demonstrating how to use Orbyter to spin up data science containers and discussing experiment management as part of the Lean AI process.
Sourav Dey & Alex NG (co-presenters) bios
As CTO for Manifold, Sourav is responsible for the overall delivery of data science and data product services to make clients successful. Before Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google / Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He earned his PhD, MS, and BS degrees from MIT in Electrical Engineering and Computer Science.
Alexander Ng is a Senior Data Engineer at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Alex served as both a Sales Engineering Tech Lead and a DevOps Tech Lead for Kyruus, a startup that built SaaS products for enterprise healthcare organizations. Alex got his start as a Software Systems Engineer at the MITRE Corporation and the Naval Undersea Warfare Center in Newport, RI. His recent projects at the intersection of systems and machine learning continue to combine a deep understanding of the entire development lifecycle with cutting-edge tools and techniques. Alex earned his Bachelor of Science degree in Electrical Engineering from Boston University, and is an AWS Certified Solutions Architect.
Principal Data Scientist
Becoming The Complete Data Scientist with Data Literacy and Data Storytelling
I will review some of the key data literacy components that contribute to successful data science in real world applications. In discussing these concepts, I will give examples through the art of data storytelling, which aims to answer the core questions that your clients, colleagues, and stakeholders want to have answered: What? So what? Now what? By focusing your effort on addressing the user questions and user requirements, which then drive your project’s data and modeling activities, which then fuel your final data products and project deliverables, you will establish yourself as a key contributor to any analytics team. Your technical skills may bring you customers, but it’s not the technical stuff that you know (i.e., your successes) that brings your customers back. What brings customers back is your customers’ successes, which are nurtured and grown through clear explanations of the data, the modeling activities, and the results, which they can then share with others.
Kirk Borne is a data scientist and an astrophysicist who has used his talents at Booz Allen since 2015. He was professor of astrophysics and computational science at George Mason University (GMU) for 12 years. He served as undergraduate advisor for the GMU data science program and graduate advisor in the computational science and informatics Ph.D. program.
Kirk spent nearly 20 years supporting NASA projects, including NASA’s Hubble Space Telescope as data archive project scientist, NASA’s Astronomy Data Center, and NASA’s Space Science Data Operations Office. He has extensive experience in large scientific databases and information systems, including expertise in scientific data mining. He was a contributor to the design and development of the new Large Synoptic Survey Telescope, for which he contributed in the areas of science data management, informatics and statistical science research, galaxies research, and education and public outreach.
Ph.D., Author, Lecturer, Core Contributor of scikit-learn
Introduction to Machine Learning
Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “learn” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.
Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.
Ph.D. in Physics and Data Scientist at Catalit LLC, Instructor at Udemy
Pre-trained models, Transfer Learning and Advanced Keras Features
You have been using keras for deep learning models and are ready to bring your skills to the next level. In this workshop we will explore the use of pre-trained networks for image classification, transfer learning to adapt a pre-trained network to your use case, multi gpu training, data augmentation, keras callbacks and support for different kernels.
Francesco Mosconi, Ph.D. in Physics and Data Scientist at Catalit LLC, Instructor at Udemy. Formerly co-founder and Chief Data Officer at Spire, a YC-backed company that invented the first consumer wearable device capable of continuously tracking respiration and physical activity. Machine Learning and python expert. Also served as Data Science lead instructor at General Assembly and The Data incubator.
Senior Software Engineer at Comet.ML
Easy Visualizations for Deep Learning
Visualizations are important in order to debug and understand how a Deep Learning model is representing a problem. In this talk, I will introduce a layer of software (ConX) that was developed on top of Keras in Jupyter Notebooks for making useful (and beautiful) visualizations of activations of a neural network. We will develop a model from scratch, train it, test it, and explore various tools for visualizing learning over time in representational space.