Running R on a Supercomputer

Oct 4, 2012. | By: Paul

After literally months of trial and error, I finally managed to run a large analysis using our program Jaatha on my new local supercomputer, superMUC. Jaatha is written in R (with the performance critical part implemented in C and C++) and normally does not require a super computer at all. However, we wanted to conduct a huge likelihood ratio test using a computationally demanding finite sites model; so it came in handy when Europe's fastest computer opened just a few kilometers away. As there is very little written about running R on a supercomputer, I want to share my solution here so that it may hopefully be a bit easier for others to do something similar.

The Parallelization Model

SuperMUC consists of over 18.000 nodes, each of which I imagine of as a separate little computer with 16 cpu cores and it's own RAM. The nodes are connected through a fast network and share a (network) hard disk. As far a I know most supercomputers are build in a similar way.

Within a node, different processes can communicate quite fast using the memory, while the commutation of processes running on different nodes has to go over the network and therefore is much slower. Hence, it is quite nice if you can do a "two step" parallelization: First a "grand master" process distributes big tasks to a "node master" processes on each node. This tasks should be relative autonomous, so that there is only very little communication between the grand master and the node masters needed. Each node master now creates 16 Workers and distributes its big task on them. Heavy communication between the node master and the workers is not a performance problem here.

Implementation in R

Luckily, doing an LRT with Jaatha fitted quite well into this model, so that I could implement it without big changes to Jaatha's algorithm. An easy way to parallelize side effect free loops in R is the awesome foreach package. Using it, you basically only have to replace the existing loops with a foreach loop (as explained in foreach's vignette) and choose one of several parallelization backends. I use two different backends, namely doRedis for the grand master to node master connection and doMC for the node master to worker one.

  • doMC is the "simpler" of the two (simpler to use at least). It is using RAM for the communication between (node) master and workers. Hence it is amazingly fast but works only within nodes. For me, it always worked right out of the box and I can really recommend using it when ever possible.
  • doRedis seems to be more complex. It uses the redis database for the interprocess communication, which is a quite nice idea because it is a database design for (a) resting in RAM rather than an slow hard disks and (b) for being accessed over network. However, you have to set a redis server first. On superMUC, you can load redis with "module load redis". Additionally to setting the two options mentioned in its vignette, I also changed all the files mentioned in the config file to point to ~/.redis/ because /var is not user writable on superMUC.
With all that, my main R script looked like this:
library(doRedis, quietly=T)

# Get command line arguments
args <- commandArgs(TRUE)
queue <- args[1]
cat("Queue: ", queue, "\n")
threads.per.node <- as.numeric(args[2])
cat("Workers per Master: ", threads.per.node, "\n")

registerDoRedis(queue)
# [...]
output <-  foreach(i=folders, .combine=c) %dopar% {
   library(doMC, quietly=T)
   registerDoMC(threads.per.node)
   # [...]
   inner.output <- foreach(i=folders, .combine=rbind) %dopar% {
      # [...]
   }
}

Now running it on superMUC...

The really tricky part was now to write a "job command file", which are instructions for superMUC's loadleveler how the job should be run. For us, this means it should reserve a certain number of nodes, start a doRedis node master on every node and afterwards execute the script. After lots of trail-and-error, this file works quite well for me:
#!/bin/bash

# Instructions for the loadleveler
#@ wall_clock_limit = 48:00:00
#@ job_name = LRT-4par-1-100
#@ node = 200
#@ tasks_per_node = 1
#@ initialdir = $(home)/LRT-4par
#@ job_type=MPICH
#@ class = general
#@ output = logs/$(job_name)-log-$(jobid)
#@ error = logs/$(job_name)-err-$(jobid)
#@ notification=always
#@ notify_user=xxx@bio.lmu.de
#@ restart = no
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh

workers_per_task=16
rscript=/lrz/sys/applications/R/2.14.0_shared/bin/Rscript

# Load R and redis
echo "Loading Modules..."
module load R/serial/2.14
module load redis

# Set some variables
wd=$LOADL_STEP_INITDIR
runfile=$wd/runfile-$LOADL_STEP_ID
server_node=`hostname`
job_name=$LOADL_JOB_NAME

echo "Total Number of Tasks: $LOADL_TOTAL_TASKS"
echo "Tasks per Node: $tasks_per_node"

# Start the redis server
echo "Starting the redis server..."
redis-server ~/.redis/redis.conf

# Start the workers
echo "Starting workers..."
for i in `seq 1 $LOADL_TOTAL_TASKS`
do
        echo "cd $wd; $rscript doRedis_addNode.R $server_node $job_name $i" &gt;&gt; $runfile
done

ksh -c "/opt/ibmll/LoadL/resmgr/full/samples/autonomous/autonomous_master.ksh -f $runfile" &amp;

# Work!
echo "Staring to work..."
$rscript LRT_4par.R $job_name $workers_per_task

echo "Finshed"

where LRT_4par.R is the main script above and the autonomous_master.ksh is an example script from IBM that I use to execute the doRedis_addNode.R script on every node. The latter looks like this:

#/usr/bin/Rscript --vanilla
library('doRedis', quietly=T)

cat("Node: ", Sys.info()['nodename'], "\n")

args <- commandArgs(TRUE)
server <- as.character(args[1])
cat("Server:", server, "\n")
queue <- as.character(args[2])
cat("Queue:", queue, "\n")
#threads.per.node <- as.numeric(args[3])
threads.per.node <- 1
cat("Threads:", threads.per.node, "\n")

startLocalWorkers(n=threads.per.node, queue=queue, host=server)
cat("\n", threads.per.node," local thread(s) started.\n")
Sys.sleep(500)

This will run our script on 200 nodes. It is can also easily be adapted to other situations, for example if you have a one step parallelization with the grand master controlling many workers, you can set threads.per.node in the addNodes script to 16 and skip the doMC part in the main script.

I hope the rest of the script is pretty self explaining. If you have any question or comment about my approach, please post it on  Google+.

Android Bloatware compiler benchmark scrm ETL Spark