Boot() in R in parallel mode on hpc

Dear all

I appear to have a problem using the boot() function in R in parallel mode on hpc. More specifically, despite specifying in the boot() function that it should run in parallel on as many cores as are available (I tried up to 36), the job takes much more time than if it is run on my laptop with 4 cores or on the GPU VM (also with 4 cores). So there seems to be a problem that escapes me when setting things up.

If anyone has some experience with boot() in R on hpc, any help/guidance would be greatly appreciated.

best wishes, simon

Hi @Simon.Hug

please show us your sbatch script. For others users reading this post: sharing your sbatch is always a good idea :wink:

Best

Dear all

I attach the script below. The issue is, however, at the level of R and the boot
function interacting with the cluster. To explain it (abstractly): I have two
functions f and g. The function g generates coefficient estimates of a
glm, while f generates coefficient estimates based on another estimator.

Bootstrapping g with

boot(demo_data,g, . . .) and the appropriately configured parallel
option makes a big difference in computing time on baobab, my Windows
laptop, and the VM GPU.

Bootstrapping f with

boot(demo_data,f, . . .) and the appropriately configured parallel
option makes a big difference in computing time on my Windows laptop
and the VM GPU, but not baobab.

Thus, boot() in parallel mode appears to interact differently with the
cluster than how it does on my Windows laptop and the VM GPU (as well
as on a Mac, which has been checked by a colleague).

best, simon

slurm-script:

#!/bin/bash

#SBATCH --partition=mono-EL7
#SBATCH --time=47:00:00
#SBATCH --cpus-per-task=8
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1000 # in MB
#SBATCH -o myjob-%A_%a.out

module load GCC/9.3.0 OpenMPI/4.0.3 R/4.0.0
module load rgdal/1.4-8-R-4.0.0

OUTFILE=srcvr_mc4s$SLURM_ARRAY_TASK_ID

Rscript --slave srcvr_mc4s.r > $OUTFILE

Dear @Simon.Hug thanks for your sbatch.

The partition mono-EL7 doesn’t exist anymore since a couple of years. Can you please let me know how you submit/run your job to the cluster?

You can as well update the rgdal version you are using, I’ve installed a new version for you if interested New software installed: rgdal version 1.6-6

Best

Yep, the mono-EL7 is a remnant of old times and I specify the public-cpu when submitting the job:

sbatch --ntasks=1 --partition=public-cpu --array=1-10 myslurm9.sh

Just as a reminder: this iob runs perfectly fine in parallel mode for function g (see above, a simple glm model) but not for function f (which is another estimator).

best wishes, simon

Can you show some sniplet of function g and f? and some time measurement comparison between your laptop, vm with gpu, and baobab? Maybe your function f can benefit from having access to a gpu and you didn’t requested one on Baobab?

Dear all

Below are the functions g and f (g is basically adapted from the boot vignette), as well as the boot functions calling them up. It bears noting that the boot call using f runs perfectly fine when the option parallel is set to “no”, but seems to get stuck if parallel is set to “multicore” (again this behavior is impossible to reproduce either under Windows or macOS). best, simon

################ function g
g ← function(dat, inds, i.pred, fit.pred, x.pred)
{
lm.b ← glm(fit+resid[inds] ~ date+log(cap)+ne+ct+log(cum.n)+pt,
data = dat)
pred.b ← predict(lm.b, x.pred)
c(coef(lm.b), pred.b - (fit.pred + dat$resid[i.pred]))
}

nuke.boot ← boot(nuke.data, nuke.fun, R = 9999, m = 1,
fit.pred = new.fit, x.pred = new.data,ncpus=cores,parallel=“multicore”,cl=cl)

################ function f
f ← function(tmp_data, f1, coms){
require(rollcall)
coms ← ~ x1 + x2
f1 ← y1 | y2 | y3 ~ 1 | 1 | 1

res_com ← rcr(tmp_data, f1 , coms,verbose=T)
return(coef(res_com))
}

nuke.boot ← boot(tmp_data, statistic=f, R = 100, m = 1,
ncpus=n_cores, parallel=“multicore”,cl=cl)

What is the value of cores in ncpus=cores ?

n_cores comes from

n_cores<-parallel::detectCores()

Dear @Simon.Hug

There is an issue with detectCores: it will detect all the cores of the compute node, not only the cores that are allocated to you. If the compute node has 128 cores and you requires 8, this will be very bad.

Please check my post about it Parallelization and loop for with R - #3 by Yann.Sagon

Taking this into account (and setting the number of cores manually to
the number requested for the debug node) I obtained the following
information on time use when bootstrapping my function f either
sequentially or in parallel mode (on yggdrasil with a debug cpu):

system.time(b3s<-boot(tmp_data, f, R = 2))
user system elapsed
685.105 0.739 43.189

system.time(b3p<-boot(tmp_data, f, R = 2, parallel = “snow”, ncpus=n_cores))
user system elapsed
330.730 0.350 143.724

For comparison purposes a similar comparison carried out on my
linux-box (ubuntu, with 4 cores who were doing some other stuff as
well) I obtained the following:

system.time(b2p<-boot(tmp_data, f, R = 2))
user system elapsed
80.936 0.015 81.088

system.time(b2p<-boot(tmp_data, f, R = 2, parallel = “snow”, ncpus=n_cores))
user system elapsed
1.636 0.000 31.349

If that offers some insights on where the problem might lie, that would be great.

best wishes, simon