Web Crawling for Research Data
Web Crawling for Research Data

Abstract: 

With the rapid development of Internet technologies and data centers, a vast amount of data has become available in online platforms such as electronic databases, blogs, review sites, web forums. Such data have great values for both academic and industry research, yet many scientists do not know how to collect the data when API is not provided by the websites. Traditionally seen applications in search engines, web crawling is a technique that automatically scrapes data from the web. Recently web crawling has gained increasing popularity in an academic research context as well, and its implementation is supported by most programming languages including Python. Meanwhile, many data providers are starting to realize the values of their data and set restrictions for automated access to it, including robot checks, login credentials, dynamic data pulling using sessions, etc. As such, consistently and robustly crawling data demands more technical skills from scientists than before.

This talk focuses on guiding academics and practitioners to develop small-scale web crawling systems for research purposes. Developing such systems cover various subtopics including programming, regular expression, web protocols, database design and interaction, multi-tasking the data collection, coordination between crawlers, handling exceptions, avoiding robot checks from websites, incremental data collection, and legal issues. Popular programming language and relevant libraries for web crawling will be introduced and demonstrated, including Python requests, BeautifulSoup, and Selenium. This talk will benefit individuals who need to collect their own copy of the dataset for research purposes. The audience is expected to have a programming background to some extent.

Bio: 

Shan Jiang is an assistant professor at the College of Management at the University of Massachusetts Boston. He received a BS in management information systems from Tsinghua University, China, and the PhD degree in management information systems from the University of Arizona. His research interests include business intelligence, social media analytics, computational linguistics, and social network analysis. His research work have appeared in journals including IEEE Transaction on Knowledge and Data Engineering, ACM Transaction on Management Information Systems, Journal of the Association for Information Science and Technology, and Decision Support Systems.