25 August 2023
Web Scraping and Data Mining with R
The widespread accessibility of the internet and the ongoing digitisation of information has transformed how people communicate and share knowledge. These changes have provided new possibilities for social scientists who can now use large sources of publicly available online data to gain new insights about individuals and their social environment. Creative uses of unsolicited online data can provide a unique window into people’s behaviours, attitudes, and beliefs in the context of significant social, political, and economic realities. Such “big data” approaches have led to significant advances in our understanding of the dynamics of political ideology, risk attitudes, health, wellbeing, misinformation, consumer behaviour, and sustainability, among many others.
The aim of this course is to introduce the core concepts and methodologies for web scraping and data mining approaches in R. During the sessions, participants will learn about the sources and types of online data that can be accessed by social scientists. They will also develop new skills in how such data can be scraped/harvested/extracted. In a series of practical exercises and activities, participants will learn about managing their own big data science projects and solving challenges associated with ethical considerations, data wrangling and formatting, web crawling, and data exploration through visualisation. During the course, participants will apply their new skills as they embark on their first data mining project under the supervision of the course lead. On the completion of the course, participants will be equipped with the necessary skills to identify, extract, process, and visualise large volumes of online data.
Course Structure
The entire course is split into theoretical and practical parts. Each day will begin with a class covering one of the core web scraping and/or data mining topics. Early afternoons (after lunch) will be devoted to practical exercises and activities, which will require participants to apply a range of data mining techniques to collect, process, and visualise online data. During the remainder of the day (late afternoon), participants will be given the opportunity to work on their own data science project. Participants are welcome to work on their own chosen topics or select a topic from a series of pre-determined mini-projects. Materials (lecture slides, sample datasets, handouts, exercises with solutions, annotated R scripts) will be made openly available via a GitHub repository to all participants.
Detailed Overview (specific activities are exemplary and subject to change)
Day 1
Morning: Basic concepts.
• Fundamental aspects of web scraping and data mining.
• R tools and skills for acquiring and processing large volumes of online data.
Afternoon: First dataset.
• Accessing and downloading large and unstructured online datasets.
Day 2
Morning: Scraping fundamentals.
• Sources of data and methods for accessing them using web crawlers.
• Practical and theoretical challenges associated with the collection and management of unstructured online datasets.
Afternoon: APIs.
• How to access data via official REST APIs.
• Pre-processing data stored in unique data formats.
Day 3
Morning: Advanced scraping.
• Understanding html structure of public websites.
• Extracting quantitative and textual data from an unstructured web source code.
Afternoon: HTML crawlers.
• Building your own html scraper.
• Deploying web crawler and extracting data from html and css code.
Day 4
Morning: Big data pitfalls.
• Pitfalls, challenges, and misuses of large volumes of data in social science.
• Good practices for performing and managing data mining projects.
Afternoon: Visualisation.
• Identifying sources of bias in a large and unstructured dataset of online communication.
• Visualisation of online data.
Day 5
Morning:
• Project work and presentation preparation.
Afternoon:
• Presentations of individual projects.
Course leader
Dr Lukasz Walasek is an associate professor at the Department of Psychology, University of Warwick, UK.
Target group
graduate students, doctoral researchers, early career researchers, experienced researchers
Prerequisites:
Participants are expected to have basic computer and statistical analysis skills. Good knowledge of R is necessary to participate in practical exercises and activities. The course will introduce participants to the basics of html, css, and JavaScript, but no prior knowledge of these topics is necessary.
Credits info
The Summer School cannot grant credits. We only deliver a Certificate of Participation, i.e. we certify your attendance.
If you consider using Summer School workshops to obtain credits (ECTS), you will have to investigate at your home institution (contact the person/institute responsible for your degree) to find out whether they recognise the Summer School, how many credits can be earned from a workshop/course with roughly 35 hours of teaching, no graded work, and no exams.
Make sure to investigate this matter before registering if this is important to you
Fee info
CHF 700: Reduced fee: 700 Swiss Francs per weekly workshop for students (requires proof of student status).
CHF 1100: Normal fee: 1100 Swiss Francs per weekly workshop for all others.