Elementary knowledge of data structures (sparse matrices, hash tables, dictionaries, graphs, sparse matrices), statistics (binomial distribution) and combinatorics (permutations, combinations). Basic programming skills in Python (with NumPy and Matplotlib packages).
The traditional data mining techniques are mainly focused on solving classification, regression and clustering problems. However, the recent developments in ICT led to the emergence of new sorts of massive data sets and related data mining problems. Consequently, the field of data mining has rapidly expanded to cover new areas of research, such as:
processing huge (tera- or petabytes big) data sets,
fast searching for similar objects, such as: documents, images, songs, routes, etc., in collections of millions or billions of such objects,
clustering of massive data sets,
real-time analysis of data streams (internet traffic, sensor data, electronic transactions),
recommending items to visitors of internet shops,
analysing big (network) graphs, such as web sites, social networks, collaboration networks, etc.
During the course we will focus on these areas. We will start with introducing a powerful framework for processing massive data sets on distributed computers: Hadoop and MapReduce. Then a new, very general similarity search technique, Locality Sensitive Hashing, will be discussed, together with its applications to plagiarism detection, searching databases with fingerprints, finding clients with similar buying behaviour, etc. Next, several algorithms for real-time mining of data streams will be introduced: Bloom filters, random sampling, counting, estimating moments. Finally, some state-of-the art recommendation systems and algorithms for dimensionality reduction and data visualization will be introduced. The practical part of the course will consists of several programming assignments (in Python) and writing reports.
After completing the course, the students should:
have a general knowledge of the recent developments in the field of Data Mining
have detailed knowledge of selected techniques and their applications
gain some hands-on experience with several algorithms for mining complex data sets
be able to apply the acquired knowledge and skills to new problems
gain some experience with mining big data sets on a cluster computer
The most recent timetable can be found at the students' website.
Mode of instruction
The final mark is composed of
(1) written exam (40%)
(2) practical assignments (60%)
A. Rajaraman, J. Leskovec, J. Ullman, Mining of Massive Datasets.
You have to sign up for courses and exams (including retakes) in uSis. Check this link for information about how to register for courses.
Please also register for the course in Blackboard.
Lecturer: Dr. Wojtek Kowalczyk.