General description

The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.

Big Data Processing provides an introduction to systems used to process Big Data. The main focus of the course is understanding the underpinnings of, programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed systems. Then, the course examines in detail the implementation of data analysis algorithms in Spark, in the context of batch processing applications, and Flink, in the context of streaming applications.

Learning objectives

After the end of the course, all students should be able to:

Course organization

Contents and tentative schedule

Week Date Topic Teacher Assignment (Deadline)
1 6/9 Course introduction,
Big and Fast data
1 9/9 Intro to course PLs,
Programming for Big Data (1): Basic data types
2 13/9 Programming for Big Data (2): FP in a nutshell BKO FP (24/09)
2 16/9 Programming for Big Data (3): Enumerating datasets BKO
3 20/9 Data Engineering on the Command Line (1),
Diomidis’s slides
3 23/9 Data Engineering on the Command Line (2),
Diomidis’s slides
DS Unix (08/10)
4 27/9 Distributed Systems BKO
4 30/9 Distributed Databases BKO
5 4/10 Distributed Filesystems BKO
5 7/10 Spark: RDDs and Pair RDDs BKO Spark (22/10)
6 11/10 Spark Internals BKO
6 14/10 Spark SQL,
Spark use cases: Synonyms with Word2Vec html, notebook, dataset from Kaggle; Recommending bands; Predicting pull request merges
7 18/10 Stream processing GG Flink (05/11)
7 21/10 Stream processing systems GG
8 25/10 Graph Processing BKO
8 28/10 No lecture
9 1/11 Recap

The schedule is subject to small changes during the term.

The lecture notes are based on the lectures taught by Georgios Gousios in 2017-2020. You can access the websites of the previous editions for 2021 and 2020.


The lab work is supervised by Thomas Overklift.

Online resources

Portions of this course have been converted to online educational material by other TU Delft teachers. Please take a look at the following EdX MOOCs / ProfEds:

Use them at your discretion to improve your skills.

(TU Delft only): You can find the Collegerama recordings from 2019 here. Please note that the course contents have slightly changed, so do not base your exam studying on the old lectures.


You can find the course assignments on Brightspace and linked through this page.

All assignments are mandatory.

For submission, we will use Weblab. The course name is CSE2520: Big Data Processing

The student groups must submit each assignment before 23:59 on the day of the deadline.

Late submission: All submissions must be handed in time, with no exceptions. Any late submission will be discarded and will be graded with 0. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.


Resit policy

There will be an exam-only resit during Q2/3. You are allowed to transfer your assignment grade to the resit as a whole. This means that you will not be able to re-submit individual assignments. Effectively, you can only resit your written exam.

Course resources

The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found. If you were to buy a single book about this course, we recommend [1].


M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.
J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available:
S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.
H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.
H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.
B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.
T. Akidau, S. Chernyak, and R. Lax, Streaming systems: The what, where, when, and how of large-scale data processing. O’Reilly, 2018.
C. Martella, R. Shaposhnik, D. Logothetis, and S. Harenberg, Practical graph analytics with apache Giraph. Springer, 2015.
I. Robinson, J. Webber, and E. Eifrem, Graph databases: New opportunities for connected data. Springer, 2015.