Introduction to parallel computing in R
原文链接 https://renkun-ken.github.io/blog/2014/02/01/introduction-to-parallel-computing-in-r.html
注:以下为加速网络访问所做的原文缓存,经过重新格式化,可能存在格式方面的问题,或偶有遗漏信息,请以原文为准。
For R beginners, for
loop is an elementary flow-control device that simplifies repeatedly calling functions with different parameters. A possible block of code is like this:
run <- function(i) {
return((i+1)/(i^2+1))
}
for(i in 1:100) {
run(i)
}
In this code, we first define a function that calculates something, and then run the function from i = 1
to i = 100
. This can be altered to a Monte Carlo simulation in which we estimate the distribution of a statistic or to calculate the theoretical price of an European call option in a binomial tree.
The code above can be reduced by high-level aggregate function lapply
or sapply
, for example, we can eliminate the for
loop by lapply
:
lapply(1:100,run)
The code will return a list of values, each of which equals run(i)
where i
is iteratively chosen from the numeric vector 1:100
. The code is made simpler, but its internal mechanism does not change at all.
However, in many cases of Monte Carlo simulation the tasks are dividable to sub-tasks that are uncorrelated with each other. This allows us to take advantage of parallel computing to boost the calculation. In other words, if a task can be divided to, for example, 3 sub-tasks that can be independently solved, we could actually use three computers (or three cores) to finish the tasks simultaneously and then aggregate the results.
In the example above, if run
function only depends on parameter i
, we should know that we can run this for
loop in parallel because it does not matter when we run i = 10
or i = 20
or in the other way around.
This idea leads to the use of parallel computing. Parallelism on a local machine employs multiple cores to perform computing tasks at the same time. It is especially useful for statisticians and econometricians when they need to figure out the distribution of a statistic produced by a particular data generating process. The distribution is estimated from a number of realizations of the statistics. If the data generating process for each realization is independent with each other, we may largely reduce the time if we get the result using parallel computing.
Here, I introduce two ways to perform parallel computing in R using different packages.
Packages for parallel computing
A considerable number of packages are developed to provide support for various paradigms of parallel computing. An official list CRAN Task View: High-Performance and Parallel Computing with R offers brief introduction to the different paradigms and the packages available.
In this article, I only introduce parallel
package and parallelMap
package.
parallel
package
parallel
package supports local multi-core parallelism. If you don't have it installed, you may call
install.packages("parallel")
The back-end mechanism is quite transparent: first, we set up a local cluster over multiple cores in CPU, which run in parallel and are able to process data simultaneously. Then we send commands to all cluster nodes (cores) to run a task specified by a function. Below is a minimal example:
library(parallel)
cl <- makeCluster(detectCores())
result <- clusterApply(cl,1:100,run)
values <- do.call(c,result)
stopCluster(cl)
First, we load parallel
library. Then we create a cluster of several nodes. detectCores()
will return the number of logical processors in your machine. Next we call clusterApply
to run parallel computing over cluster cl
we just created, and through the vector 1:100
each node calls run
function defined above. The computation will yield a list of returned values of run
. To aggregate these numbers in list
, we use do.call
to pass the list as parameters to the function c
to combine all these values into a numeric vector. Finally, we stop the cluster and clear the resources.
If our task returns a vector containing more than one values, we still do not have to change much of our code above. For example, if run
function return a named vector of a
, b
, and c
each time:
run <- function(i) {
return(c(a=i,b=i+1,c=i*2))
}
We don't need to change anything but how we aggregate the results. Here it is:
library(parallel)
cl <- makeCluster(detectCores())
result <- clusterApply(cl,1:100,run)
values <- do.call(rbind,result)
stopCluster(cl)
We only change c
to rbind
in do.call
function so that the list
of returned named vectors are combined row by row, which finally makes a matrix with column names a
, b
, and c
. If you want to get a data frame in the final result, there are two ways to do it.
One way is to call data.frame
to convert the matrix to data frame after we have already obtained the matrix.
values.df <- data.frame(values)
The other way is to change run
function so that it directly returns a data frame with a single row.
run <- function(i) {
return(data.frame(a=i,b=i+1,c=i*2))
}
Here we don't need the change anything in the rest of the code.
parallelMap
package
The functionality of parallelMap
package is quite similar with that of parallel
package except that we don't need to explicitly operate the cluster object. If you don't have this package installed, run the code:
install.packages("parallelMap")
To initialize a cluster, we run the following code:
library(parallelMap)
parallelStart("socket",cpus=4)
Then the environment has an implicitly defined local cluster of 4 CPUs, each node of which communicate with each other by socket. Here you don't have to know anything about socket. We have just created a similar cluster as we did with parallel
package. But this time, we don't need to manage the cluster object by ourselves. The package will automatically manage it.
To run the same task we did before, we run the code:
result <- parallelLapply(1:100,run)
values <- do.call(rbind,result)
Note that here we use parallelLapply
and don't need to explicitly specify which cluster
we use since on a local machine we usually have only one cluster. The code looks much simpler. In addition, the way to produce data frames perfectly applies here too.
Conclusion
A tip for writing R loop in which iterations are independent with each other is to eliminate it. I rarely use for
when sapply
and lapply
can finish the same task. If you use these high-order functions, it is likely that they can be easily switched to a parallel version.
As a result, a better development procedure is like this: First, write code with sapply
or lapply
to ensure the code works. Then alter these functions to their parallel version if you need a higher performance.
This post only works for the situation where the function each node runs does not require non-elementary packages and does not refer to outer resources in the environment. In my later posts, I will introduce how we run functions in standalone code file over cluster nodes which may require non-elementary packages, and how we pass variables in the current environment to the environment of the cluster nodes.