Big Data Management and Analytics

when 12 September 2022 - 16 September 2022

language English

duration 1 week

credits 2 EC

fee EUR 500

This course introduces systems and techniques for storing, querying, and working with datasets that are too large, too complex, or simply too inconvenient to work with on a single machine or programming language. Participants learn the foundations necessary to work with available “Big Data systems” on their own, whether in a local installation or via cloud computing. It is organized in a workshop format, i.e., morning sessions that introduce and discuss key concepts and techniques, followed by practical sessions in which participants gain hands-on experience on selected systems and applications. The course makes use of Python as the main programming language; it's one of the most suitable languages for data science with large, complex datasets.

We start with an introduction (or refresher, depending on the participant's background) of processing structured data (e.g., data frames), first directly within Python, then using a relational database system and the SQL query language for data access. Building on these foundations, the course introduces the large-scale computation engine Apache Spark for pre-processing and analysing data in a scalable fashion. We subsequently introduce and discuss non-relational data representation formats that are suitable for more complex data, most notably JSON (JavaScript Object Notation, for semi-structured data and documents) and, if time permits, RDF (Resource Description Framework, for graph data and knowledge graphs). The course concludes with an introduction into selected NoSQL databases that are useful for managing such data.

Course leader

Prof. Dr. Rainer Gemulla, Adrian Kochsiek

Target group

Participants will find the course useful if:
▪ They want to work with large and/or complex datasets.
▪ They want to leverage available data management and processing solutions (either locally or in the cloud) for improved efficiency and ease of use.

Course aim

By the end of the course participants will:
▪ Understand different data representations (including relational data, semi-structured data, and graph data) and their advantages/disadvantages.
▪ Be able to process structured data in Python (using Pandas).
▪ Know how to insert, update, and query structured data in a relational database system using the SQL query language (using MySQL).
▪ Be familiar with the Apache Spark framework for performing computations on large datasets.
▪ Be able to perform basic parallel data processing with Apache Spark.
▪ Know basic types of NoSQL systems as well as their properties.
▪ Be able to store, query, and process semi-structured data in a NoSQL database (e.g., Apache HBase or MongoDB)

Fee info

EUR 500: Students
EUR 750: Academics

Scholarships

None

Organizing institution

GESIS Fall Seminar in Computational Social Sciences

all courses school website