10 July 2020
on course website
An Introduction to Data Visualisation and Cluster Analysis (Using R)
The problem with too much data
We’re currently living in the most data rich period of human history (to date). Which seems like it should be a good thing but data does not necessarily always equal information, knowledge or insight. Quite apart from the ethics of how such data is collected and used, the sheer amount of data available can be more overwhelming than useful. The goal of extracting useful information from available data is the main role of the data scientist.
This course, in particular, introduces numerous different learning techniques that can assist in this process. The issue with many more advanced, complex techniques (including some in this course) is that while they can uncover hidden, intricate relationships and evidence, many of them have problems when the data involved are big. Data can be “big” in three different ways:
• many observations,
• many variables,
The more complicated models often do not scale well as either the number of observations and/or the number of variables increase dramatically. So some kind of dimension reduction method is needed either prior to or in conjunction with the other model in order to allow it to run.
Why should we care about clustering? Some examples:
Assume that you are responsible for the data analytics department of a small start-up company. This company has an online-only store where customers can buy different products. One of your jobs would be to personalise each user’s shopping experience and recommend products that he/she would buy with high probability. You do not know each user’s personal preferences but you do have lots of information about each user (i.e. records of all the previous purchases). How can you use all this information to give each user a set of unique recommendations in order for him/her to buy more products?
You are also working part-time on a PhD in Social Sciences with the really catchy title “What Cambridge Analytica missed?” where you have available the data set that was given to that company and now you are trying to group people together according to their characteristics and lifestyles (from their Facebook accounts).
Finally, the online store went bankrupt (since they did not listen to your sage advice) and now you were just hired by your city council where your job is to identify groups of houses according to their house type, their value, and their geographical location.
You can (and will) use clustering techniques to help answer all the previous questions.
Aims of cluster analysis
Clustering techniques are used extensively in many different fields; including image analysis, statistics, medical science, machine learning, pattern recognition and many others. One can perform a cluster analysis technique for any of the following reasons:
• for exploratory analysis,
• to detect a hidden pattern in the data,
• to get an estimate of the groups that underly the population that we are interested in,
• as a stand-alone-tool to get insight into the data distribution,
• as a preprocessing step for other algorithms/techniques,
• as a dimensionality reduction tool.
The main idea of cluster analysis is to group observations together such that observations within a group are similar to each other and observations in different groups are not. In a way, the main idea behind clustering is to try to find hidden structure in data which are not already labelled; hence we organize data into clusters to bring to the surface any internal structure the data might have. Like classification, cluster analysis is interested in group structure in multivariate data. Unlike classification (and this is pretty important), in cluster analysis we have no a priori group information. All the data are unlabelled with respect to the group information and we want to find out if there actually are any groups in the data and if so, how many there are and what they look like.
Charalampos (Charis) Chanialidis, Lecturer, School of Mathematics and Statistics
University of Glasgow, UK
• Advanced Bachelor
A perfect preparation for conducting clustering analysis research in your Bachelor's or Master's thesis or PhD research, or as an update of what you might have learned before. It also serves as an enhancement to your pre-master programme.
After this course you are able to:
1. Read existing datasets into R, visualise them, and perform simple statistical analysis on them.
2. Have an overview of the different clustering techniques which will enable you to make the right methodological choices.
3. Apply these techniques and interpret their output (in the programming language R).
EUR 600: The fee includes the registration fees, course materials, access to library and IT facilities, coffee/tea, lunch, and a number of social activities.
We offer several reduced fees:
€ 540 early bird discount- deadline 1 March 2020 (10%)
€ 510 partner + RU discount (15%)
€ 450 early bird + partner + RU discount (25%)
on course website