Distributed Data Processing Systems, 2022-2023 - Prospectus

Admission requirements

This is a practical-intensive course, which involves lots of low-level systems programming. Mastering all this is also highly rewarding. Students are assumed to have taken courses in (advanced) programming, computer architecture/organization, computer networks, and operating systems at a BSc level.

Description

Distributed (data processing) systems are already pervasive, most members of our society interact with them daily. Social networks, government services, media streaming services--all systems heavily interfacing with the members of our society--are powered by distributed (data processing systems). Such systems are composed of many physically distributed computers, all connected through a network which acts as a means of communication. The data processing component of large-scale distributed systems regards all services that handle data. Such systems include data storage (e.g., HDFS, Ceph), data transmission and queueing (e.g., gRPC, ZeroMQ), key-value stores (e.g., Redis, Cassandra), and analytics (e.g., Spark, Hive), batch (e.g., Hadoop) and stream processing (e.g., Flink), as well as machine learning (e.g., Tensorflow, pytorch). Most of our online interactions generate data that passes through one or more of such systems.

As computer scientists, systems engineers, or devops, most of the large-scale systems we work with, either directly or indirectly, are actually distributed data processing systems. Both academia and industry invest significant effort into: (i) understanding the performance of these systems; (ii) understanding the interaction between these systems and their underlying computing infrastructure; (iii) defining principled design processes for building such systems; and (iv) building more efficient systems, that seamlessly scale with the number of users, machines, and workloads.

This course is a highly practical, systems-first approach at understanding distributed data processing systems. We will discuss general topics on the design of data processing systems, such as which parts are these composed of (e.g., storage, resource management, scheduling, communication). We will also treat general topics on performance evaluation, such as: benchmarking, workloads, metrics, statistical analysis, and how to design repeatable experiments. Finally, we also discuss more general distributed systems topics, such as their (non-)functional requirements: communication/RPC, fault-tolerance, consensus.

Course objectives

After following this course, students will be able to:
1. Explain the main concepts of distributed data processing systems, e.g., communication, resource management and scheduling, data consistency, fault-tolerance, performance.
2. Explain and identify trade-offs for designing certain components of distributed data processing systems, e.g., erasure-coding vs. replication for fault-tolerance.
3. Apply state-of-the-art experiment design techniques for reproducible performance evaluation.
4. Design, build, and evaluate distributed data processing systems.
5. Deploy and run on real-world distributed infrastructure (e.g., DAS-5) state-of-the-art distributed data processing systems (e.g., Spark, Hadoop, Flink etc.)

Timetable

The most recent timetable can be found at the Computer Science (MSc) student website.

You will find the timetables for all courses and degree programmes of Leiden University in the tool MyTimetable (login). Any teaching activities that you have sucessfully registered for in MyStudyMap will automatically be displayed in MyTimeTable. Any timetables that you add manually, will be saved and automatically displayed the next time you sign in.

MyTimetable allows you to integrate your timetable with your calendar apps such as Outlook, Google Calendar, Apple Calendar and other calendar apps on your smartphone. Any timetable changes will be automatically synced with your calendar. If you wish, you can also receive an email notification of the change. You can turn notifications on in ‘Settings’ (after login).

For more information, watch the video or go the the 'help-page' in MyTimetable. Please note: Joint Degree students Leiden/Delft have to merge their two different timetables into one. This video explains how to do this.

Mode of instruction

The course is composed of three components:
1. Lectures given by the instructor, to setup course format, assignments, and to teach key topics in distributed data processing systems.
2. Self-study Lab Assignments. Students are expected to work on the lab assignments autonomously, outside of the lectures. This is a hands-on course, where the lab constitutes a large part of the final grade. There are two large-scale lab components: (1) a reproducibility study, where students implement several experiments from existing papers, to get acquainted with modern technology and state-of-the-art in experiment design; (2) a build-your-own-distributed-system, where students learn how to build a prototype distributed system.
3. Student Presentations and in-class discussion. Each student should prepare a 10-min presentation on a scientific article. This is followed by a 10-min Q&A discussion session, where all other students participate.

Course load

Total hours of study: 168 hrs.
Practical work: 100 hrs
Lectures & Tutoring: 20 hrs
Self-tuition: 48 hrs

Assessment Method

The assessment method is based on the following student assignments:

Reproducibility study: 35% (group-based, mandatory)
Build a system: 35% (group-based, mandatory)
Presentation: 20% (group-based, mandatory)
Position paper: 10% (group-based, optional)

Each student must score a sufficient score (>= 5.5) in each of the four sub-parts of the grade. Partial scores will not be kept between academic years.

The teacher will inform the students how the inspection of and follow-up discussion of the exams will take place.

Reading list

There is no book to follow, students will be given at the beginning of the course a literature list which will be available via Brightspace.

Registration

From the academic year 2022-2023 on every student has to register for courses with the new enrollment tool MyStudyMap. There are two registration periods per year: registration for the fall semester opens in July and registration for the spring semester opens in December. Please see this page for more information.

Please note that it is compulsory to both preregister and confirm your participation for every exam and retake. Not being registered for a course means that you are not allowed to participate in the final exam of the course. Confirming your exam participation is possible until ten days before the exam.

Extensive FAQ's on MyStudymap can be found here.

Contact

Please contact the course coordinator: Alexandru Uta, a.uta@liacs.leidenuniv.nl

Distributed Data Processing Systems