Apache-Spark first steps – Margus Roo

There are two main operations in Spark:

1. Transformations – apply some function to all records in dataset. In example map – mapping input to some new output.

2. Actions – runs some computation and aggregation operation and returns result to driver. In example count and reduce.

Spark’s transformations are lazy. It means if you do not use any action you do not get any results after transformations.

Spark recomputes transformation again before each action. In case you do not want it you can save dataset into memory calling:

> transformationDataSet.persist()

Load data

scala> val data = sc.textFile(“./demo.csv”) // load datafile

var csvData = data.map(l => l.split(“,”).map(_.trim)) // apply split(“,”) to each row in file. Apply trim to each element in row.

inFile.map(x => x.split(‘ ‘)(0)).reduce((a,b) => a)

Here we apply transformation map to each line in inFile dataset. We split each row using ‘ ‘ and take first element. Then we apply action reduce where real staff actually happens due the laziness.

reduce – aggregates elements in dataset using function which takes two arguments and returns one. The function should be commutative and associative so that it can be computed correctly in parallel.

scala> foo
res94: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> foo.reduce((a, b) => a+b)
res95: Int = 21

As I pointed out function have to be associative. In example (1+6)+(2+5)+(3+4)=21 or (2+6)+(1+5)+(3+4)=21

filter (action)

scala> foo
res51: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> foo.filter(x => x > 3)
res52: List[Int] = List(4, 5, 6)

scala> csvData.map(l => l(0).toInt).map(_+1).map(_-1).reduce((a,b) => a+b) // one pointless example how to pipe transfer operations.

Anonymous functions (An anonymous function is a function that is not stored in a program file, but is associated with a variable whose data type is function_handle . Anonymous functions can accept inputs and return outputs, just as standard functions do. However, they can contain only a single executable statement)

scala> val increment = (x: Int) => x + 1

scala> increment(1)
res157: Int = 2

scala> csvData.map(l => l(0).toInt).map(increment(_)).reduce((a,b) => a+b)

scala> val sum2 = (x: Int, y: Int) => x + y

scala> csvData.map(l => l(0).toInt).map(increment(_)).reduce((a,b) => sum2(a,b))