Skip to content

Margus Roo –

If you're inventing and pioneering, you have to be willing to be misunderstood for long periods of time

  • Cloudbreak Autoscale fix
  • Endast

Apache-Spark first steps

Posted on June 16, 2015 - June 17, 2015 by margusja

There are two main operations in Spark:

1. Transformations – apply some function to all records in dataset. In example map – mapping input to some new output.

2. Actions – runs some computation and aggregation operation and returns result to driver. In example count and reduce.

Spark’s transformations are lazy. It means if you do not use any action you do not get any results after transformations.

Spark recomputes transformation again before each action. In case you do not want it you can save dataset into memory calling:

> transformationDataSet.persist()

 

Load data

scala> val data = sc.textFile(“./demo.csv”) // load datafile

var csvData = data.map(l => l.split(“,”).map(_.trim)) // apply split(“,”) to each row in file. Apply trim to each element in row.

inFile.map(x => x.split(‘ ‘)(0)).reduce((a,b)  => a)

Here we apply transformation map to each line in inFile dataset. We split each row using ‘ ‘ and take first element. Then we apply action reduce where real staff actually happens due the laziness.

reduce – aggregates elements in dataset using function which takes two arguments and returns one. The function should be commutative and associative so that it can be computed correctly in parallel.

scala> foo
res94: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> foo.reduce((a, b) => a+b)
res95: Int = 21

As I pointed out function have to be associative. In example (1+6)+(2+5)+(3+4)=21 or (2+6)+(1+5)+(3+4)=21

filter (action)

scala> foo
res51: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> foo.filter(x => x > 3)
res52: List[Int] = List(4, 5, 6)

scala> csvData.map(l => l(0).toInt).map(_+1).map(_-1).reduce((a,b) => a+b) // one pointless example how to pipe transfer operations.

Anonymous functions (An anonymous function is a function that is not stored in a program file, but is associated with a variable whose data type is function_handle . Anonymous functions can accept inputs and return outputs, just as standard functions do. However, they can contain only a single executable statement)

scala> val increment = (x: Int) => x + 1

scala> increment(1)
res157: Int = 2

scala> csvData.map(l => l(0).toInt).map(increment(_)).reduce((a,b) => a+b)

scala> val sum2 = (x: Int, y: Int) => x + y

scala> csvData.map(l => l(0).toInt).map(increment(_)).reduce((a,b) => sum2(a,b))

Posted in Machine Learning, Math

Post navigation

Clear to land
Arduino and GPRS shield

The Master

Categories

  • Apache
  • Apple
  • Assembler
  • Audi
  • BigData
  • BMW
  • C
  • Elektroonika
  • Fun
  • Hadoop
  • help
  • Infotehnoloogia koolis
  • IOT
  • IT
  • IT eetilised
  • Java
  • Langevarjundus
  • Lapsed
  • lastekodu
  • Linux
  • M-401
  • Mac
  • Machine Learning
  • Matemaatika
  • Math
  • MSP430
  • Muusika
  • neo4j
  • openCL
  • Õpetaja identiteet ja tegevusvõimekus
  • oracle
  • PHP
  • PostgreSql
  • ProM
  • R
  • Turvalisus
  • Varia
  • Windows
Proudly powered by WordPress | Theme: micro, developed by DevriX.