CSE2520: Exam November 2021

Please read the following information carefully.

This exam consists of 16 multiple choice questions (16 pts), 3 open ended questions (7 pts), and 3 programming questions (9 pts).
The points for the multiple-choice part of the exam are computed as: \[\#correct\_answers - (\#incorrect\_answers / 3)\] This accounts for a 25% guessing correction, corresponding to the four-choice questions we use.
You have 2 hours to complete this exam.
Before you hand in your answers, check that the sheet contains your name and student number, both in the human and computer-readable formats.
The use of the book, notes, calculators or other sources is strictly prohibited.
Note that the order of the letters next to the boxes on your multiple-choice sheet may not always be A-B-C-D! Tip: mark your answers on this exam first, and only after you are certain of your answers, copy them to the multiple-choice answer form.
Read every question properly and in the case of the open questions, give all information requested, which should always include a brief explanation of your answer. Do not however give irrelevant information - this could lead to a deduction of points.
Note that the minimum score per (sub)question is 0 points.
You may write on this exam paper and take it home.
Exam is © 2021 TU Delft.

Multiple choice questions (16 questions, 16 pts)

The command ls -l lists the file runTests.py as follows:
```
-rwxr-xr-x 1 alice alice 1728 Nov 21 10:32 runTests.py
```
Which of the following is not correct for the runTests.py?

The size of the file is 1728 bytes
It can only be written by the user alice
It can only be executed by the user alice
Its permissions shortcode is 755

Which of the pipelines lists duplicate line (which are repeated more than once) in the file file.txt together with the number of their occurrences? We assume that a line cannot be repeated by ten or more times.

Reminder on the options of uniq command:
-c prefixes lines by the number of occurrences
-d only prints duplicate lines, one for each group

sort file.txt | uniq -c | sort -n | tr -s ' ' | grep "^ [1-9]* .*"
sort file.txt | uniq -c | sort -n | tr -s ' ' | grep "^ [2-9]* .*"
sort file.txt | uniq -d | sort -n | tr -s ' ' | grep "^ [1-9]* .*"
sort file.txt | uniq -d | sort -n | tr -s ' ' | grep "^ [2-9]* .*"

What do the following lines evaluate?

val list = List(1,2,3,4)
val unknown = list.foldRight(Nil:List[Int])(_ :: _)

List(4, List(3, List(2, List(1))))
List(1,2,3,4)
List(4,3,2,1)
List(1, List(2, List(3, List(4))))

Choose the correct implementation of the Monad interface in Scala:

      trait Monad[M[_]] {
          def unit[S](a: S) : M[S]
          def flatMap[S, T] (m: M[S], f: S => M[T]) : Monad[T]
      }

        trait Monad[M[_]] {
          def unit[S](a: S) : M[S]
          def map[T] (m: M[S], f: S => M[T]) : Monad[T]
        }

        trait Monad[M[_]] {
          def map[T] (m: M[S], f: S => M[T]) : Monad[T]
          def reduce[T,B](init: B, f: (B,T) => M[B]): Monad[B]
        }

        trait Monad[M[_]] {
          def map[T] (m: M[S], f: S => M[T]) : Monad[T]
          def flatMap[S, T] (m: M[S], f: S => M[T]) : T
        }

Which of the following is false about fault-tolerance of Spark?

Relies on the immutability of RDDs
Uses RDD lineage to identify what needs to be recomputed
Recomputes DAG to reassign the task of the faulty node
There is no need to restart the whole application from scratch if a worker node fails

Which Spark RDD function does the following signature correspond to? \[ RDD[A].f(z: B, op1: (B, A) \rightarrow B, op2: (B, B) \rightarrow B ) \rightarrow B \]

reduce
fold
aggregate
mapPartitions

Which of the following statements about Spark is false?

Transformations are not executed until an action is called
mapValues is an example of a function that requires shuffling
Allows the programmer to customize the partitioning of pair RDDs
Lazy transformations enable Spark to run more efficiently

In a system with 3 nodes A, B, C, six events took place in the following order:
- Event 1. A sends msg1 to B
- Event 2. B sends msg2 to A
- Event 3. B receives msg1
- Event 4. A receives msg2
- Event 5. B sends msg3 to C
- Event 6. C receives msg3
Which pair of events are not causally related (i.e., they are concurrent) to each other? (You can compute vector clocks for checking causal dependencies.)

(Event 2, Event 4)
(Event 3, Event 4)
(Event 3, Event 5)
(Event 3, Event 6)

Which of these statements about replication architectures is false?

With synchronous replication, the write operations cannot be processed if a follower does not respond
Asynchronous replication guarantees that the followers are up-to-date with the leader
Multi-leader replication requires a policy to resolve conflicting write requests
Leaderless replication processes read/write queries based on a quorum of replicas

Which of these sentences about the Raft consensus algorithm is false?

Raft defines 2 cluster states: “Leader election” and “Log replication”.
Every Raft term has exactly one leader.
Raft ensures that only servers with up-to-date logs can become leader.
The Raft algorithm works by assuming that the exchanged messages are valid and true.

What is Byzantine fault tolerance?

Resilience against multiple node crashes
Resilience against multiple message losses
Resilience against partitioned network
Resilience against malicious nodes

We’re designing a system that replicates a sequence of user commands on a geographically distributed set of nodes. Our system ensures that all previous commands are replicated to all the nodes before processing a new one.

Given that our system runs on an unreliable network, which of the properties cannot always be satisfied by our system?

Atomicity
Availability
Consistency
Durability

A GFS chunkserver/HDFS datanode is responsible to:

Store filesystem path information
Split the data into partitions
Store data partitions
Replicate the data onto multiple disks

Which of the following is false about Pregel and Bulk synchronization parallel (BSP) model?

BSP is an edge-centric approach where the algorithm iterates over the edges
BSP executes in supersteps each of which involves local computation followed by message sending
Supersteps are globally synchronized among all vertices
Pregel is an adaptation of BSP for graph processing

An advertisement company wants to monitor the sequence of videos viewed by a registered user in one sitting. What type of streaming windows would be the most suitable to analyze the user behavior?

Sliding window
Tumbling window
Global window
Session window

What will be produced by the following Flink code snippet:
```
dataStream.map(c => (c.id, 1)).keyBy(x => x._1).sum(1)
```
- Stream of the sums per ID
- Stream with a single number (the total sum)
- Stream of increasing integers
- Runtime Exception

Open-ended questions (3 questions, 7 pts)

(2pts) The figure below depicts an execution history that operates on a distributed integer variable x whose initial value is 0. The operations in the history are labelled with A-G. Is the given execution history linearizable? If yes, provide a valid linearization. If not, provide an example operation that violates linearizability.

Answer:

It is not linearizable.

The operation E cannot read x=1 after the operation G which takes effect before E and reads x=2. 

It would be linearizable if G reads x=1 or E reads x=2

(3pts) Given the following Spark program, write the types of the variables a, b, and c.

  case class Student(sid: Int, sname: String)
  case class Enrollment(sid: Int, cid: Int)

  val students: RDD[Student] = // reads students from a file into an RDD 
  val enrollments: RDD[Enrollment] = // reads enrollments from a file into an RDD

  val a = students.map(s => (s.sid, s.sname))
  val b = enrollments.groupBy(_.sid).mapValues(_.size)
  val c = a.join(b)

Answer:

val a: RDD[(Int, String)]

val b: RDD[(Int, Int)]

val c: RDD[(Int, (String, Int))]

(2pts) What are wide and narrow transformations in Spark? Describe them with 1-2 sentences and provide an example for each type of transformations.

Answer:

Wide transformations are transformations that involve a shuffle of the data between the partitions.

Example: `groupByKey()`, `join()`

Narrow transformations are transformations for which each input partition will contribute to only one output partition and they do not require shuffle between partitions.

Example: `map()`, `filter()`

Functional programming questions (3 questions, 9 points)

See the FP practice questions in Weblab.

CSE2520: Exam November 2021

Burcu Kulahcioglu Ozkan

5 November 2021

Multiple choice questions (16 questions, 16 pts)

Open-ended questions (3 questions, 7 pts)

Functional programming questions (3 questions, 9 points)