Programming Languages for Big Data Processing

Scala and Python

The de facto languages of Big Data and data science are

  • Scala Mostly used for data intensive systems
  • Python Mostly used for data analytics tasks

Other languages include

  • Java The “assembly” of big data systems; the language that most big data infrastructure is written into.
  • R The statistician’s tool of choice. Great selection of libraries for serious data analytics, great plotting tools.

In our course, we will be using Scala.

Scala and Python from 10k feet

  • Both support object orientation, functional programming and imperative programming
    • Scala’s strong point is the combination of FP and OO
    • Python’s strong point is the combination of OO and IP
  • Python is interpreted, Scala is compiled

Hello world

Scala

object Hello extends App {
    println("Hello, world")
    for (i <- 1 to 10) {
      System.out.println("Hello")
    }
}
  • Scala is compiled to JVM bytecode
  • Can interoperate with JVM libraries
  • Scala is not sensitive to spaces/tabs. Blocks are denoted by { and }

Declarations

Scala

val a: Int = 5
val b = 5
b = 6 // re-assignment to val

// Type of foo is inferred
val foo = new ImportantClass(...)

var a = "Foo"
a = "Bar"
a = 4 // type mismatch
  • Type inference used extensively
  • Two types of variables: vals are single-assignment, vars are multiple assignment

Declaring functions

Scala

def max(x: Int, y: Int): Int = 
  if (x >= y) x else y
  • Statically typed
  • Evaluated expressions have types
  • The return type is the most generic type of all return expressions

Higher order functions

Scala

def bigger(x: Int, y: Int,
  f: (Int,Int) => Boolean) =
  f(x, y)

bigger (1, 2, (x, y) => (x < y))
bigger (1, 2, (x, y) => (x > y))
// Compile error
bigger (1, 2, x => x)

bigger is a higher-order function, i.e. a function whose behaviour is parametrised by another function. f a function parameter. To call a HO function, we need to pass a function with the appropriate argument types. The compiler checks this in the case of Scala.

Declaring classes

Scala

class Foo(val x: Int,
          var y: Double = 0.0)

// Type of a is inferred
val a = new Foo(1, 4.0)
println(a.x) //x is read-only
println(a.y) //y is read-write
a.y = 10.0
println(a.y) //y is read-write
a.y = "Foo"   // Type mismatch, y is double
  • val means a read-only attribute. var is read-write
  • A default constructor is created automatically

Object-Oriented programming

Scala

class Foo(val x: Int,
          var y: Double = 0.0)

class Bar(x: Int, y: Int, z: Int)
  extends Foo(x, y)

trait Printable {
  val s: String
  def asString() : String
}

class Baz(x: Int, y: Double, private val z: Int)
  extends Foo(x, y) with Printable {
  override val s: String = s
  override def asString(): String = ???
}

In both cases, the traditional rules of method overriding apply. Traits in Scala are similar to default interfaces in Java > 9; in addition, they can include attributes (state).

Data classes

Scala

case class Address(street: String, 
  number: Int)
case class Person(name: String, 
  address: Address)

val p = Person("G", Address("a", 2))

Data classes are blueprints for immutable objects. We use them to represent data records. Both languages implement equals (or __eq__) for them, so we can compare objects directly.

Pattern matching in Scala

Pattern matching is if..else on steroids

// Code for demo only, won't compile

value match {
  // Match on a value, like if
  case 1 => "One"
  // Match on the contents of a list
  case x :: xs => "The remaining contents are " + xs
  // Match on a case class, extract values
  case Email(addr, title, _) => s"New email: $title..."
  // Match on the type
  case xs : List[_] => "This is a list"
  // With a pattern guard
  case xs : List[Int] if xs.head == 5 => "This is a list of integers"
  case _ => "This is the default case"
}

Reading ahead

This is by far not an introduction to either programming languages. Please read more here

Bibliography

[1]
G. Hutton, “A tutorial on the universality and expressiveness of fold,” Journal of Functional Programming, vol. 9, no. 4, pp. 355–372, 1999.