Automated Web Data Collection with R

when 12 September 2022 - 16 September 2022

language English

duration 1 week

credits 2 EC

fee EUR 500

The increasing availability of large amounts of data on the internet enables new lines of research in the social sciences. Although it has become easier to find data online that is relevant to social science research, such as social media content, election results, or organizations' press statements, extracting these data and bringing it into formats ready for downstream analyses can be challenging. Web data collection is thus an essential skill for researchers.

The goal of this course is to enable participants to collect web data and process it in R for their research. Course participants will learn about the characteristics of web data and their use in social science research, how to harvest content from different types of webpages, and how to collect social media data from application programming interfaces (APIs), such as the Twitters API.

We will cover tools and techniques that enable participants to collect web data relevant to their research and focus on two common scenarios in particular: (i) automating the collection of data presented on multiple web pages (e.g., several pages) of both static and dynamic websites (with RSelenium), and (ii) interacting with APIs to, for example, collect social media data or datasets from institutions, companies, and organizations. In addition, we will cover advanced topics such as using web sessions, interacting with HTML forms (e.g., login), managing user agents, error handling, and headless browsing.

The course is hands-on, with daily lectures followed by exercises where participants can practice their newly learned skills.

Course leader

Dr. Theresa Gessler, Dr. Hauke Licht

Target group

Participants will find the course useful if they want to:
- collect larger amounts of web data from APIs or webpages
- learn about best practices in automated web data collection
- improve their existing web scraping skills by deepening their understanding of common web technologies and learning more about the process of developing robust web scrapers

Course aim

By the end of the course participants will:
- Know the most important characteristics of web data, including webpage content and social media data
- Gain an understanding of a variety of scraping scenarios: APIs, static pages, dynamic pages
- Be able to write reproducible and robust code for web scraping tasks
- Be able to parse, clean, and process data collected from the web

Fee info

EUR 500: Students
EUR 750: Academics

Scholarships

None

Organizing institution

GESIS Fall Seminar in Computational Social Sciences

all courses school website