16 September 2022
Automated Web Data Collection with Python
The continuously growing importance of the internet for everyday life and the correspondingly increasing volume of digital behavioral data on the Web allows us to study human behavior from new perspectives. However, accessing or collecting such data is not always straightforward. Moreover, the heterogeneity of collected data poses the challenge of data pre-processing, ensuring that they can be effectively used in further analyses. Thus, this course aims to introduce participants to data collection from online platforms and the pre-processing necessary to make it usable for their research. Apart from these essential, technical foundations, we will also discuss basic methods to enrich raw, textual data with additional features. Lastly, we will present participants with a framework for the critical reflection on their data collection processes and documentation of their data.
This course will show and teach participants how content, comment, and interaction data can be automatically collected from social media platforms (e.g., Twitter, YouTube, Reddit) or other online platforms (e.g., eBay, Amazon). We will cover the main aspects of collecting data using the programming language Python, including APIs and custom scrapers for static and dynamic webpages. We will also show how collected data can be cleaned, pre-processed, and curated to enable further statistical analyses.
The course will include lectures on each topic, introducing the basic theoretical concepts necessary for understanding the practical implementations, which are then practiced during exercises. The exercises will be conducted in small groups and assisted by the instructors, who may help with questions and problems. In mini-projects, participants have the chance to discuss how they can apply and integrate the newly learned methods within their research.
Felix Soldner, Dr. Jun Sun, Leon Froehling
Participants will find the course useful if:
- they are interested in working with web data
- they want to learn how to collect web data through APIs or webpages
- they want to learn how to pre-process and augment the collected data for further analyses (basic NLP)
- they want to learn about frameworks for the critical reflection on web-data collection processes
By the end of the course, participants will:
- be able to collect online data with APIs and custom scrapers for static and dynamic websites
- be able to handle, (pre-)process and augment data for further statistical analyses
- be able to integrate the learned methods into their research
- be able to reflect and inspect automatically-collected data critically
EUR 500: Students
EUR 750: Academics