Introduction to parallel computing in R

2014-02-01 Kun Ren 更多博文 » 博客 » GitHub »

r parallel

原文链接 https://renkun-ken.github.io/blog/2014/02/01/introduction-to-parallel-computing-in-r.html
注:以下为加速网络访问所做的原文缓存,经过重新格式化,可能存在格式方面的问题,或偶有遗漏信息,请以原文为准。


For R beginners, for loop is an elementary flow-control device that simplifies repeatedly calling functions with different parameters. A possible block of code is like this:

run <- function(i) {
  return((i+1)/(i^2+1))
}
for(i in 1:100) {
  run(i)
}

In this code, we first define a function that calculates something, and then run the function from i = 1 to i = 100. This can be altered to a Monte Carlo simulation in which we estimate the distribution of a statistic or to calculate the theoretical price of an European call option in a binomial tree.

The code above can be reduced by high-level aggregate function lapply or sapply, for example, we can eliminate the for loop by lapply:

lapply(1:100,run)

The code will return a list of values, each of which equals run(i) where i is iteratively chosen from the numeric vector 1:100. The code is made simpler, but its internal mechanism does not change at all.

However, in many cases of Monte Carlo simulation the tasks are dividable to sub-tasks that are uncorrelated with each other. This allows us to take advantage of parallel computing to boost the calculation. In other words, if a task can be divided to, for example, 3 sub-tasks that can be independently solved, we could actually use three computers (or three cores) to finish the tasks simultaneously and then aggregate the results.

In the example above, if run function only depends on parameter i, we should know that we can run this for loop in parallel because it does not matter when we run i = 10 or i = 20 or in the other way around.

This idea leads to the use of parallel computing. Parallelism on a local machine employs multiple cores to perform computing tasks at the same time. It is especially useful for statisticians and econometricians when they need to figure out the distribution of a statistic produced by a particular data generating process. The distribution is estimated from a number of realizations of the statistics. If the data generating process for each realization is independent with each other, we may largely reduce the time if we get the result using parallel computing.

Here, I introduce two ways to perform parallel computing in R using different packages.

Packages for parallel computing

A considerable number of packages are developed to provide support for various paradigms of parallel computing. An official list CRAN Task View: High-Performance and Parallel Computing with R offers brief introduction to the different paradigms and the packages available.

In this article, I only introduce parallel package and parallelMap package.

parallel package

parallel package supports local multi-core parallelism. If you don't have it installed, you may call

install.packages("parallel")

The back-end mechanism is quite transparent: first, we set up a local cluster over multiple cores in CPU, which run in parallel and are able to process data simultaneously. Then we send commands to all cluster nodes (cores) to run a task specified by a function. Below is a minimal example:

library(parallel)
cl <- makeCluster(detectCores())
result <- clusterApply(cl,1:100,run)
values <- do.call(c,result)
stopCluster(cl)

First, we load parallel library. Then we create a cluster of several nodes. detectCores() will return the number of logical processors in your machine. Next we call clusterApply to run parallel computing over cluster cl we just created, and through the vector 1:100 each node calls run function defined above. The computation will yield a list of returned values of run. To aggregate these numbers in list, we use do.call to pass the list as parameters to the function c to combine all these values into a numeric vector. Finally, we stop the cluster and clear the resources.

If our task returns a vector containing more than one values, we still do not have to change much of our code above. For example, if run function return a named vector of a, b, and c each time:

run <- function(i) {
  return(c(a=i,b=i+1,c=i*2))
}

We don't need to change anything but how we aggregate the results. Here it is:

library(parallel)
cl <- makeCluster(detectCores())
result <- clusterApply(cl,1:100,run)
values <- do.call(rbind,result)
stopCluster(cl)

We only change c to rbind in do.call function so that the list of returned named vectors are combined row by row, which finally makes a matrix with column names a, b, and c. If you want to get a data frame in the final result, there are two ways to do it.

One way is to call data.frame to convert the matrix to data frame after we have already obtained the matrix.

values.df <- data.frame(values)

The other way is to change run function so that it directly returns a data frame with a single row.

run <- function(i) {
  return(data.frame(a=i,b=i+1,c=i*2))
}

Here we don't need the change anything in the rest of the code.

parallelMap package

The functionality of parallelMap package is quite similar with that of parallel package except that we don't need to explicitly operate the cluster object. If you don't have this package installed, run the code:

install.packages("parallelMap")

To initialize a cluster, we run the following code:

library(parallelMap)
parallelStart("socket",cpus=4)

Then the environment has an implicitly defined local cluster of 4 CPUs, each node of which communicate with each other by socket. Here you don't have to know anything about socket. We have just created a similar cluster as we did with parallel package. But this time, we don't need to manage the cluster object by ourselves. The package will automatically manage it.

To run the same task we did before, we run the code:

result <- parallelLapply(1:100,run)
values <- do.call(rbind,result)

Note that here we use parallelLapply and don't need to explicitly specify which cluster we use since on a local machine we usually have only one cluster. The code looks much simpler. In addition, the way to produce data frames perfectly applies here too.

Conclusion

A tip for writing R loop in which iterations are independent with each other is to eliminate it. I rarely use for when sapply and lapply can finish the same task. If you use these high-order functions, it is likely that they can be easily switched to a parallel version.

As a result, a better development procedure is like this: First, write code with sapply or lapply to ensure the code works. Then alter these functions to their parallel version if you need a higher performance.

This post only works for the situation where the function each node runs does not require non-elementary packages and does not refer to outer resources in the environment. In my later posts, I will introduce how we run functions in standalone code file over cluster nodes which may require non-elementary packages, and how we pass variables in the current environment to the environment of the cluster nodes.