This update also includes some news for R users. First, the DataFrame object continues to be the primary interface for R (and Python users). Although the DataSets structure in Spark is more general, using the single-table DataFrame construct makes sense for R and Python which have analogous native (or near-native, in Python's case) data structures. In addition, Spark 2.0 is set to add a few new distributed statistical modeling algorithms: generalized linear models (in addition to the Normal least-squares and logistic regression models in Spark 1.6); Naive Bayes; survival (censored) regression; and K-means clustering. The addition of survival regression is particularly interesting. It's a type of model used in situations where the outcome isn't always completely known: for example, some (but not all) patients may have yet experienced remission in a cancer study. It's also used for reliability analysis and lifetime estimation in manufacturing, where some (but not all) parts may have failed by the end of the observation period. To my knowledge, this is the first distributed implementation of the survival regression algorithm.
For R users, these models can be applied to large data sets stored in Spark DataFrames, and then computed using the Spark distributed computing framework. Access to the algorithms is via the SparkR package, which hews closely to standard R interfaces for model training. You'll need to first create a SparkDataFrame object in R, as a reference to a Spark DataFrame. Then, to perform logistic regression (for example), you'll use R's standard glm function, using the SparkDataFrame object as the data argument. (This elegantly uses R's object-oriented dispatch architecture to call the Spark-specific GLM code.) The example below creates a Spark DataFrame, and then uses Spark to fit a logisic regression to it:
df <- createDataFrame(sqlContext, iris) model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
The object model contains most (but not all) of the output created by R's traditional glm algorithm, so most standard R functions that work with GLM objects work here as well.
For more on what's coming in Spark 2.0 check out the DataBricks blog post below, and the preview documentation for SparkR contains more info as well. Also, you might want to check out how you can use Spark on HDInsight with Microsoft R Server.
Databricks: Preview of Apache Spark 2.0 now on Databricks Community Edition