Learning Python by Cleaning Data in Pandas
Learning Python by Cleaning Data in Pandas


Data Science and Machine learning have been synonymous with languages like Python. Libraries like Numpy and Pandas have become the de facto standard when working with data. The DataFrame object provided by Pandas gives us the ability to work with heterogeneous unstructured data that is commonly used in ""real world"" data.

New learners are often drawn to Python and Pandas because of all the different and exciting types of models and insights the language can do and provide but are awestruck when faced with the initial learning curve.

By the end of this tutorial, you should have a solid foundation in working with datasets in Python. The last topic of encoding dummy variables segues into using other libraries, such as scikit-learnand statsmodels to fit models on your data.


Daniel is a PhD candidate in Genetics, Bioinformatics, and Computational Biology at Virginia Tech, currently working in the Social and Decision Analytics Laboratory in the Biocomplexity Institute at Virginia Tech. His current interests are in understanding how attitudes change and spread within social networks as well as performing analytics for precision medicine.

Daniel received his MPH in the Department of Epidemiology at Columbia University and hold a BA in Psychology - Behavioral Neuroscience with minors in Biology and Computer Science from the Macaulay Honors College at CUNY Hunter College.

Daniel enjoys teaching and volunteers some of his time to Software Carpentry by teaching, serving on the Mentoring subcommittee, and chair the Assessment subcommittee.