Quick Start
We provide a quick guide to developing stochastic algorithms via Splash. After reading this section, you may look at the examples, or learn about the Splash Machine Learning Package.
Install Splash
To install Splash, you need to:
- Download and install the latest version of Scala, sbt and Apache Spark.
- Download the Splash jar file and put it in your project classpath.
- Make the Splash jar file as a dependency when submitting Spark jobs.
For step 2-3, see how to compile and submit these examples.
Import Splash
When Splash is in your project classpath, you can write a self-contained Scala application using the Splash API. Besides importing Spark packages, you also need to import the Splash package in scala, by typing:
import splash.core._
Create Dataset
The first step is to create a distributed dataset. Splash provides an abstraction called a Parametrized RDD. The Parametrized RDD is very similar to the Resilient Distributed Dataset (RDD) of Spark. It can be created from a standard RDD:
val paramRdd = new ParametrizedRDD(rdd)
where rdd
is the RDD that holds your dataset.
Set Up the Data Processing Function
To execute the algorithm, set a data processing function process
for the dataset by invoking
paramRdd.setProcessFunction(process)
The process
function is implemented by the user. It takes four objects as input: a data element from the dataset, the weight of this element, the shared variable shared across the dataset and the local variable associated with this element. The process
function reads the input to perform some update on the variables. The capability of processing weighted elements is mandatory even if the original data is unweighted. This is because that Splash will automatically assign non-unit weights to the samples as a part of its parallelization strategy. For example, the algorithm for computing document statistics can be implemented as:
val process = (line: String, weight: Double, sharedVar: SharedVariableSet, localVar: LocalVariableSet) => {
sharedVar.add("lines", weight)
sharedVar.add("words", weight * line.split(" ").length)
sharedVar.add("characters", weight * line.length)
}
In the programming interface, all variables are stored as key-value pairs. The key must be a string. The value can be either a real number of an array of real numbers. In the above code, the shared variable is updated by the add
method. If the algorithm needs to read a variable, use the get
method:
val v1 = localVar.get(key)
val v2 = sharedVar.get(key)
The local variable can be updated as follows:
localVar.set(key,value)
The shared variable must be updated by transformations. Splash provides three types of operators for transforming the shared variable: add, delayed-add and multiply. See the Splash API section for more information.
Running the Algorithm
After setting up the processing function, call the run()
method to start running the algorithm:
val spc = (new SplashConf).set("max.thread.number", 8)
paramRdd.run(spc)
so that every thread starts processing its local dataset and synchronizes at the end. In the default setting, every thread takes a full pass over its local dataset by calling the run()
method. You can change the amount of data to be processed by configuring the spc
object. You can also take multiple passes over the dataset by multiple calls to the run()
method.
Output and Evaluation
After processing the dataset, you can output the result by:
val sharedVar = paramRdd.getSharedVariable()
println("Lines: " + sharedVar.get("lines").toLong)
println("words: " + sharedVar.get("words").toLong)
println("Characters: " + sharedVar.get("characters").toLong)
It is also possible to MapReduce the Parametrized RDD. For example, by calling
val loss = paramRdd.map(func).sum()
every element in the dataset is mapped by the func
function. If func
returns a real number, it will be aggregated across the dataset. This is useful for evaluating the performance of the algorithm. See Splash API for more options.