Name:
Student Number:
BSc CS/TI or minor student?:
The total number of points is 75. You have 180 minutes: Good Luck!
Please provide brief answers to the following questions. Be precice!
Provide the function signatures for the following functions of the type List[A]
.
foldL:
reduceR:
flatMap:
scan:
groupBy:
(2 points) Implement groupBy
for List[A]
using foldL
.
foldL
for List[A]
using foldR
leftJoin(xs:[(K, A)], ys:[(K, B)])
for KV pairs xs
and ys
where the type K
signifies the key. What is the return type?
Example 1, in Scala
var sum = 0
for (i <- 1 to 10) {
= sum + i
sum }
Example 2, in Python
= {}
x for i in [1,2,3,4,5,6]:
= i % 2
key if key in x.keys():
x[key].append(i)else:
= [i] x[key]
A:
C:
I:
D:
(10 points) You are tasked to design a system that reliably delivers messages from one client to another (think Whatsapp). The process to deliver a message from client A to client B is the following:
The system needs to work 24/7 and be resilient to database and queue failures. You will be using ready-made databases and queues that support replication, sharding and transactions. How would you design it? Discuss the advantages and disadvantages of your design decisions.
You are given the following dataset; it contains citation information in the IEEE CS format (lines that end with \
continue to the next line):
S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: \
Patterns for learning from data at scale. O’Reilly Media, Inc., 2015.
H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: \
Lightning-fast big data analysis. O’Reilly Media, Inc., 2015.
B. Chambers and M. Zaharia, Spark: The definitive guide. O’Reilly Media, Inc., 2017.
M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.
H. Karau and R. Warren, High performance spark. O’Reilly Media, Inc., 2017.
T. H. Cormen, C. E. Leiserson, Ronald L. Rivest, and C. Stein, Introduction \
to algorithms (3rd ed.). MIT press, 2009.
P. Louridas, Real world algorithms. MIT press, 2017.
The format looks like this:
author1, author2, ... , authorN, title. publisher, year.
O’Reilly Media, Inc., 5
MIT press, 2
M. Zaharia
.
Only answer the following questions if you are a BSc TI student!
For the next 2 questions, you are given a CSV file whose contents look like the following:
$ cat fileA
user_id, account_id, balance
10, 12, 1000
10, 1, 32
12, 122, 5
...
$ cat fileB
account_id, transaction_id, amount
12, 332, 21
122, 21, 20
...
fileA
Please select and circle only one answer per question.
(1 point) What is the correct function signature for reduce on Spark RDDs?
RDD[A].reduce(f: (A,B) -> B)
RDD[A].reduce(f: (A,A) -> A)
RDD[A].reduce(init: B, seqOp: (A, B) -> A, combOp: (B, B) -> B)
RDD[A].reduce(init:A, f: (A,A) -> A)
(1 point) Distributed systems:
(1 point) Lamport timestamps
(1 point) The serializable
transaction isolation level protects transcations against:
(1 point) Which of the following methods is part of the Observer
interface, for dealing with push-based data consumption?
def subscribe(obs: Observer[A]): Unit
def onNext(a: A): Unit
def map(f: (A) -> B): [B]
def onExit(): Unit
(1 point) An transformation in Spark:
(1 point) Which of the following is a likely computation order of applying reduceR
to a list of 10 integers with the ‘+’ operator?
(1 point) Collaborative filtering:
(1 point) What is precision in the context of Machine Learning? (\(TP\) = True Positive, \(FP\) = False Positive, \(TN\) = True Negative, \(FN\) = False Negative)
(1 point) What is Byzantine fault tolerance?
(1 point) What is eventual consistency?
(1 point) Which of the following RDD API calls is a performance killer in Spark?
reduceByKey
keyBy
groupByKey
aggregatebyKey
(1 point) Copy-on-write is a technique to:
(1 point) What is the correct signature for rightOuterJoin
on Spark RDDs?
RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, W))]
RDD[(K,V)].rightOuterJoin(other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
(1 point) A GFS chunkserver/HDFS datanode is responsible to: