CSE2520: Big Data Processing, 2024

General description

The term “Big Data” describes datasets that are either too big or change too fast or both to be processed on a single computer.

Big Data Processing provides an introduction to systems used to process Big Data. The main focus of the course is understanding the underpinnings of, programming and engineering big data systems; initially, the course explores general programming primitives that span across big data systems and touches upon distributed systems. Then, the course examines in detail the implementation of data analysis algorithms in Spark, in the context of batch processing applications, and Flink, in the context of streaming applications.

Learning objectives

After the end of the course, all students should be able to:

Apply basic data processing operations (filtering, folding, projecting, etc)
Explain basic techniques (vector clocks, consensus) in distributed systems
Explain basic data management techniques in distributed databases
Explain the major components of batch processing systems
Create batch algorithms for novel (unseen) practical problems
Explain the difference between iterative/non-iterative algorithms
Explain basic graph algorithms and their implementation batch processing systems
Describe in which scenaria streaming algorithms are most applicable
Apply basic streaming algorithms to practical problems

Course organization

5 ECTS: This means that you need to devote at least 140 hours of study for this course.
Online meetings: The course consists of 12 2-hour meetings. You are not required, but you are strongly encouraged, to attend.
Homework: In the homework assignments, you will have to write code or reply to open questions. You will always work in pairs.
Groups: The students are responsible to form pairs and communicate them to the course TAs, by registering them to Weblab.
Labs: 4 hours per week, designed to help you work together and get support from teaching assistants.
Teaching Assistants: Teaching assistants will be available during lab hours to provide your with feedback on your assignments. Do be active in asking questions, but don’t expect them to provide you with solutions.

Contents and tentative schedule

Week	Date	Topic	Teacher	Assignment (Deadline)
1	4/9	Course introduction, Big and Fast data, Intro to course PL	BKO
1	6/9	Programming for Big Data (1): Basic data types, FP in a nutshell	BKO
2	11/9	no lecture		FP (28/9)
2	13/9	Programming for Big Data (2): Enumerating datasets	BKO
3	18/9	Data Engineering on the Command Line (1), Diomidis’s slides	DS	Unix (5/10)
3	20/9	Data Engineering on the Command Line (2), Diomidis’s slides	DS
4	25/9	Distributed Systems	BKO
4	27/9	Distributed Databases	BKO
5	2/10	Distributed Filesystems	BKO
5	4/10	Spark: RDDs and Pair RDDs	BKO	Spark (23/10)
6	9/10	Spark Internals, Spark SQL	BKO
6	11/10	Graph Processing	BKO
7	16/10	Stream processing	BKO	Flink (02/11)
7	18/10	Stream processing systems	BKO
8	23/10	Recap (no lecture)

The schedule is subject to small changes during the term.

You can access the websites of the previous edition here. The lecture notes are based on the lectures taught by Georgios Gousios in 2017-2020.

Teachers

The lab work is supervised by Thomas Overklift.

Online resources

Portions of this course have been converted to online educational material by other TU Delft teachers. Please take a look at the following EdX MOOCs / ProfEds:

Unix Tools: Data, Software and Production Engineering, by Diomidis Spinellis
Introduction to Functional Programming for Big Data Processing, by Jan Rellermeyer
Taming Big Data Streams: Real-time Data Processing at Scale, by Asterios Katsifodimos

Use them at your discretion to improve your skills.

(TU Delft only): You can find the Collegerama recordings from 2019 here. Please note that the course contents have slightly changed, so do not base your exam studying on the old lectures.

Assignments

You can find the course assignments on Brightspace and linked through this page.

All assignments are mandatory.

For submission, we will use Weblab. The course name is CSE2520: Big Data Processing

You need to signup to enroll and also declare your pairs
All the assignments have deadlines
Feedback and grading is automatic: the results are available on Weblab.
Technical support: ask the Mattermost channel
- If no feedback after 1 hour: DO ASK THE TAs.

The student groups must submit each assignment before 23:59 on the day of the deadline.

Late submission: All submissions must be handed in time, with no exceptions. Any late submission will be discarded and will be graded with 0. In case of provable sickness, please contact the course teacher to arrange a case-specific deadline.

Assessment

Lab assignments (40%): Grade calculated as mean grade for all assignments. There is no minimum grade per individual assignment. If you don’t submit an assignment, or the submission is late, you will get a 0. Each assignment counts for 25% of the lab part. The final lab grade has a minimum of 5.
Written Exam (60%): Closed-book exam. Minimum grade: 5

Example exam material

Resit policy

There will be an exam resit during Q2/3. You are allowed to transfer your assignment grade to the resit as a whole.

If the final assignment grade is in between 4 and 5.7, one of the previously submitted assignments can be repaired.

Partial grades (assignments or exam) cannot be transferred to subsequent years.

Disclaimer: information may change depending on unforeseen circumstances or measures (see: TER).

Course resources

Online lab sessions every Wednesday afternoon
- Ask questions about the next assignment
- Ask questions about your grades
You are welcome to join the BDP 2024-2025 Mattermost team.
- General questions can be asked here. Questions specific to your solutions should be asked during the lab
- TAs won’t always be active; don’t expect answers outside of lab hours
- You are recommended to help each other, but refrain from sharing solutions (this is considered fraud)
- All group members are responsible for the submission of a group assignment. Piggybacking is considered fraud.
- Be nice :)

The course, by design, touches upon various current technologies; as such, there is no single source of truth. The following is an indicative list of resources where more information can be found. If you were to buy a single book about this course, we recommend [1].

Bibliography

[1]

M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.

[2]

J. Laskowski, “Mastering apache spark 2,” 2017. [Online]. Available: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details.

[3]

S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.

[4]

H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.

[5]

H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.

[6]

B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.

[7]

T. Akidau, S. Chernyak, and R. Lax, Streaming systems: The what, where, when, and how of large-scale data processing. O’Reilly, 2018.

[8]

C. Martella, R. Shaposhnik, D. Logothetis, and S. Harenberg, Practical graph analytics with apache Giraph. Springer, 2015.

[9]

I. Robinson, J. Webber, and E. Eifrem, Graph databases: New opportunities for connected data. Springer, 2015.