# Parallelizing Linear Regression or Using Multiple Sources

Share Tweet

My previous post was explaining how mathematically it was possible to parallelize computation to estimate the parameters of a linear regression. More speficially, we have a matrix

which is
matrix and
a
-dimensional vector, and we want to compute
by spliting the job. Instead of using the
observations, we’ve seen that it was to possible to compute “something” using the first
rows, then the next
rows, etc. Then, finally, we “aggregate” the
objects created to get our overall estimate.

## Parallelizing on multiple cores

Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we’ve seen in the previous post. Here, the data we use are

 1 2 3  y = cars$dist X = data.frame(1,cars$speed) k = ncol(X)

On my laptop, I have three cores, so we will split it in

chunks

 1 2 3 4  library(parallel) library(pbapply) ncl = detectCores()-1 cl = makeCluster(ncl)

This is more or less what we will do: we have our dataset, and we split the jobs,

We can then create lists containing elements that will be sent to each core, as Ewen suggested,

 1 2 3 4 5  chunk = function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) a_parcourir = chunk(seq_len(nrow(X)), ncl) for(i in 1:length(a_parcourir)) a_parcourir[[i]] = rep(i, length(a_parcourir[[i]])) Xlist = split(X, unlist(a_parcourir)) ylist = split(y, unlist(a_parcourir))

It is also possible to simplify the QR functions we will use

 1 2 3 4 5 6 7 8  compute_qr = function(x){ list(Q=qr.Q(qr(as.matrix(x))),R=qr.R(qr(as.matrix(x)))) } get_Vlist = function(j){ Q3 = QR1[[j]]$Q %*% Q2list[[j]] t(Q3) %*% ylist[[j]] } clusterExport(cl, c("compute_qr", "get_Vlist"), envir=environment()) Then, we can run our functions on each core. The first one is  1   QR1 = parLapply(cl=cl,Xlist, compute_qr) note that it is also possible to use  1   QR1 = pblapply(Xlist, compute_qr, cl=cl) which will include a progress bar (that can be nice when the database is rather large). Then use  1 2 3 4 5 6 7   R1 = pblapply(QR1, function(x) x$R, cl=cl) %>% do.call("rbind", .) Q1 = qr.Q(qr(as.matrix(R1))) R2 = qr.R(qr(as.matrix(R1))) Q2list = split.data.frame(Q1, rep(1:ncl, each=k)) clusterExport(cl, c("QR1", "Q2list", "ylist"), envir=environment()) Vlist = pblapply(1:length(QR1), get_Vlist, cl=cl) sumV = Reduce('+', Vlist)

and finally the ouput is

 1 2 3 4  solve(R2) %*% sumV [,1] X1 -17.579095 X2 3.932409

which is what we were expecting…

## Using multiple sources

In practice, it might also happen that various “servers” have the data, but we cannot get a copy. But it is possible to run some functions on their server, and get some output, that we can use afterwards.

Datasets are supposed to be available somewhere. We can send a request, and get a matrix. Then we we aggregate all of them, and send another request. That’s what we will do here. Provider

should run
on his part of the data, that function will return
. More precisely, to the first provider, send

 1 2 3  function1 = function(subX){ return(qr.R(qr(as.matrix(subX))))} R1 = function1(Xlist[[1]])

and actually, send that function to all providers, and aggregate the output

 1  for(j in 2:m) R1 = rbind(R1,function1(Xlist[[j]]))

The create on your side the following objects

 1 2 3 4  Q1 = qr.Q(qr(as.matrix(R1))) R2 = qr.R(qr(as.matrix(R1))) Q2list=list() for(j in 1:m) Q2list[[j]] = Q1[(j-1)*k+1:k,]

Finally, contact one last time the providers, and send one of your objects

 1 2 3 4  function2=function(subX,suby,Q){ Q1=qr.Q(qr(as.matrix(subX))) Q2=Q return(t(Q1%*%Q2) %*% suby)}

Provider

should then run
on his part of the data, using also
as argument (that we obtained on own side) and that function will return
. For instance, ask the first provider to run

 1  sumV = function2(Xlist[[1]],ylist[[1]], Q2list[[1]])

and do the same with all providers

 1  for(j in 2:m) sumV = sumV+ function2(Xlist[[j]],ylist[[j]], Q2list[[j]])
 1 2 3 4  solve(R2) %*% sumV [,1] X1 -17.579095 X2 3.932409

which is what we were expecting…

Share Tweet
__data