Parametrized RDD Operators

The Parametrized RDD provides a similar set of operators as those provided by Spark’s RDD. Since the Parametrized RDD maintains local variables and shared variables, there are operators manipulating the data structure.

Operator Meaning
this(rdd) Constructor. It returns a Parametrized RDD object constructed from rdd. The partitioning of original rdd will be preserved.
reshuffle() Reshuffle all elements across partitions. If your original dataset has not been shuffled, this operation is recommended at the creation of the Parametrized RDD.
map(func) Return a RDD formed by mapping each element by function func. The function takes the element and the associated local/shared variables as input
foreach(func) Process each element by function func. The function takes the element and the associated local/shared variables as input.
reduce(func) Reduce all elements by function func. The function takes two elements as input and returns a single element as output.
mapSharedVariable(func) Return a RDD formed by mapping the shared variable set by function func.
foreachSharedVariable(func) Process the shared variable set by function func.
reduceSharedVariable(func) Reduce all shared variable sets by function func. The function takes two SharedVariableSet objects as input and returns a single SharedVariableSet object as output.
syncSharedVariable() Synchronize the shared variable across all partitions. This operation often follows the execution of the above four operations. If the shared variables is manually changed but not synchronized, the change may not actually take effect.
getSharedVariable() Return the set of shared variables in the first partition.
getAllSharedVariables() Return the set of shared variables in all partitions.
setProcessFunction(func) Set the data processing function. The function func takes an arbitrary element, the weight of the element and the associated local/shared variables. It performs update on the local/shared variables.
setLossFunction(func) Set a loss function for the stochastic algorithm. The function func takes an element and the associated local/shared variables. It returns the loss incurred by this element. Setting a loss function for the algorithm is optional, but a reasonable loss function may help Splash choose a better degree of parallelism.
run(spc) Use the data processing function to process the dataset. spc is a SplashConf object. It specifies the hyper-parameters that the system needs.
duplicateAndReshuffle(n) Make n-1 copies of every element and reshuffle them across partitions. This will enlarge the dataset by a factor of n. Parallel threads can reduce communication costs by passing a larger local dataset.
duplicate(n) Make n-1 copies of every element without reshuffling.


Local Variable Operators

The local variables associated with a data element are organized as a LocalVariableSet instance. The supported operators are:

Operator Meaning
get(key) Return the value of associated with the key. The value is 0 if the variable has never been set.
set(key, value)                          Set the variable indexed by key to be equal to value.
toArray() Convert the variable set to an array of key-value pairs.


Shared Variable Operators

The shared variables are organized as a SharedVariableSet instance. The supported operators are:

Operator Meaning
get(key) Return the value of the key. The initial value is 0.
add(key, delta) Add delta to the value of the key.
delayedAdd(key, delta) Same as add, but the operation will not be executed instantly. Instead, it will be executed at the next time that the same element is processed. The delayed operation is useful for reversing a previous operation on the same element, or for passing information to the future.
multiply(key, gamma) Multiply the value of the key by gamma.
declareArray(key, length) Declare an array associated with the key. The length argument indicates the dimension of the array. The array has to be declared before manipulated. Generally speaking, manipulating an array of real numbers is faster than manipulating the same number of key-value pairs.
getArray(key) Return the array associated with the key. It will return null if the array has not been declared.
getArrayElement(key, ind) Return the array element with index ind. It will return 0 if the array has not been declared.
getArrayElements(key, inds) Return array elements with specified indices inds: Array[Int].
addArray(key, delta) Add delta: Array[Double] to the array associated with the key.
addArrayElement(key, ind, delta) Add delta: Array[Double] to the specified indices inds: Array[Int]. The dimenions of delta and inds should be equal.
addArrayElements(key, inds, delta) Add delta to the array element with index ind.
delayedAddArray(key, delta) The same as addArray, but the operation will not be executed until the next time the same element is processed.
delayedAddArrayElement(key, delta, delta) The same as addArrayElement, but the operation will not be executed until the next time the same element is processed.
multiplyArray(key, gamma) Multiply all elements of the array by a real number gamma. The computation complexity of this operation is O(1), independent of the dimension of the array.
dontSync(key) The system will not synchronize this variable at the next round of synchronization. This will improve the communication efficiency, but may cause unpredictable consistency issues. Don’t register a variable as dontSync unless you are sure that it will never be used by other partitions. The dontSync declaration is only effective at the current iteration.
dontSyncArray(key) The same as dontSync, but the object is an array.


Splash Configurations

The Splash Configuration class allows the user to customize the execution of the algorithm. Given a SplashConf instance spc, the properties are set by

spc.set(propertyName,propertyValue)

Here is a list of configurable properties:

Property Name  Default Value  Meaning
max.thread.num 0 The maximum number of threads to run the algorithm. If max.thread.num = 0, then the maximum number of threads is the number of partitions.
auto.thread true If the value is true, then the system will automatically determine the number of threads to run the algorithm. Otherwise, the number of threads will be equal to the maximum number of threads.
data.per.iteration 1.0 Proportion of local data processed in each iteration. In each iteration, every local thread goes through data.per.iteration proportion of its partition, then all threads synchronize at the end of the iteration. This proportion cannot be greater than 1. If you want to take multiple passes over the dataset in one iteration, use duplicate or duplicateAndReshuffle to enlarge the dataset.