Programming Languages for Big Data Processing

Scala and Python

The de facto languages of Big Data and data science are

  • Scala Mostly used for data intensive systems
  • Python Mostly used for data analytics tasks

Other languages include

  • Java The “assembly” of big data systems; the language that most big data infrastructure is written into.
  • R The statistician’s tool of choice. Great selection of libraries for serious data analytics, great plotting tools.

In our course, we will be using Scala and Python.

Scala and Python from 10k feet

  • Both support object orientation, functional programming and imperative programming
    • Scala’s strong point is the combination of FP and OO
    • Python’s strong point is the combination of OO and IP
  • Python is interpreted, Scala is compiled

Hello world

Scala

object Hello extends App {
    println("Hello, world")
    for (i <- 1 to 10) {
      System.out.println("Hello")
    }
}
  • Scala is compiled to JVM bytecode
  • Can interoperate with JVM libraries
  • Scala is not sensitive to spaces/tabs. Blocks are denoted by { and }

Declarations

Scala

val a: Int = 5
val b = 5
b = 6 // re-assignment to val

// Type of foo is inferred
val foo = new ImportantClass(...)

var a = "Foo"
a = "Bar"
a = 4 // type mismatch
  • Type inference used extensively
  • Two types of variables: vals are single-assignment, vars are multiple assignment

Declaring functions

Scala

def max(x: Int, y: Int): Int = 
  if (x >= y) x else y
  • Statically typed
  • Evaluated expressions have types
  • The return type is the most generic type of all return expressions

Higher order functions

Scala

def bigger(x: Int, y: Int,
  f: (Int,Int) => Boolean) =
  f(x, y)

bigger (1, 2, (x, y) => (x < y))
bigger (1, 2, (x, y) => (x > y))
// Compile error
bigger (1, 2, x => x)

bigger is a higher-order function, i.e. a function whose behaviour is parametrised by another function. f a function parameter. To call a HO function, we need to pass a function with the appropriate argument types. The compiler checks this in the case of Scala.

Declaring classes

Scala

class Foo(val x: Int,
          var y: Double = 0.0)

// Type of a is inferred
val a = new Foo(1, 4.0)
println(a.x) //x is read-only
println(a.y) //y is read-write
a.y = 10.0
println(a.y) //y is read-write
a.y = "Foo"   // Type mismatch, y is double
  • val means a read-only attribute. var is read-write
  • A default constructor is created automatically

Object-Oriented programming

Scala

class Foo(val x: Int,
          var y: Double = 0.0)

class Bar(x: Int, y: Int, z: Int)
  extends Foo(x, y)

trait Printable {
  val s: String
  def asString() : String
}

class Baz(x: Int, y: Double, private val z: Int)
  extends Foo(x, y) with Printable {
  override val s: String = s
  override def asString(): String = ???
}

In both cases, the traditional rules of method overriding apply. Traits in Scala are similar to default interfaces in Java > 9; in addition, they can include attributes (state).

Data classes

Scala

case class Address(street: String, 
  number: Int)
case class Person(name: String, 
  address: Address)

val p = Person("G", Address("a", 2))

Data classes are blueprints for immutable objects. We use them to represent data records. Both languages implement equals (or __eq__) for them, so we can compare objects directly.

Pattern matching in Scala

Pattern matching is if..else on steroids

// Code for demo only, won't compile

value match {
  // Match on a value, like if
  case 1 => "One"
  // Match on the contents of a list
  case x :: xs => "The remaining contents are " + xs
  // Match on a case class, extract values
  case Email(addr, title, _) => s"New email: $title..."
  // Match on the type
  case xs : List[_] => "This is a list"
  // With a pattern guard
  case xs : List[Int] if xs.head == 5 => "This is a list of integers"
  case _ => "This is the default case"
}

Reading ahead

This is by far not an introduction to either programming languages. Please read more here

Bibliography

[1]
G. Hutton, “A tutorial on the universality and expressiveness of fold,” Journal of Functional Programming, vol. 9, no. 4, pp. 355–372, 1999.