16 September 2022
Big Data Management and Analytics
This course introduces systems and techniques for storing, querying, and working with datasets that are too large, too complex, or simply too inconvenient to work with on a single machine or programming language. Participants learn the foundations necessary to work with available “Big Data systems” on their own, whether in a local installation or via cloud computing. It is organized in a workshop format, i.e., morning sessions that introduce and discuss key concepts and techniques, followed by practical sessions in which participants gain hands-on experience on selected systems and applications. The course makes use of Python as the main programming language; it's one of the most suitable languages for data science with large, complex datasets.
Prof. Dr. Rainer Gemulla, Adrian Kochsiek
Participants will find the course useful if:
▪ They want to work with large and/or complex datasets.
▪ They want to leverage available data management and processing solutions (either locally or in the cloud) for improved efficiency and ease of use.
By the end of the course participants will:
▪ Understand different data representations (including relational data, semi-structured data, and graph data) and their advantages/disadvantages.
▪ Be able to process structured data in Python (using Pandas).
▪ Know how to insert, update, and query structured data in a relational database system using the SQL query language (using MySQL).
▪ Be familiar with the Apache Spark framework for performing computations on large datasets.
▪ Be able to perform basic parallel data processing with Apache Spark.
▪ Know basic types of NoSQL systems as well as their properties.
▪ Be able to store, query, and process semi-structured data in a NoSQL database (e.g., Apache HBase or MongoDB)
EUR 500: Students
EUR 750: Academics