===== R on Farber =====
==== Learning R ====
== SWIRL ===
In addition to other resources, SWIRL is installed on the Farber cluster and is available as an interactive learning guide
inside R:
$ vpkg_require r/3 r-cran
$ R -q --no-save
> library(swirl)
> swirl()
==== R libraries and extensions ====
=== Installed library bundles ===
The cluster also has the majority of [[http://cran.us.r-project.org/|CRAN]]
and [[http://www.bioconductor.org/|Bioconductor]] R libraries already
insalled. These are installed as point-in-time snapshots of their
respective catalogs. These libraries are broken down into different valet
packages based on dependencies. The current bundles are below. Together
these bundles provide access to over 6,600 R modules, pre-compiled and ready
for use.
^r-cran |All CRAN modules in CRAN which compile and install cleanly without any additional dependencies. N.B. all below library packs require this CRAN modle as a base.|
^r-bioconductor |The full suite of[[http://www.bioconductor.org/|Bioconductor]] modules. |
^r-fftw |CRAN modules which need FFTW |
^r-gsl |CRAN modules which need GSL(GNU Scientific Library), GLPK(GNU Linear Programming Kit), or MPFR(GNU MPFR Library) |
^r-gdal |CRAN modules which need GDAL(Geospatial Data Abstraction Library) and GEOS(Geometry Engine, Open Source) |
^r-jags |CRAN modules which need JAGS(Just Another Gibbs Sampler) and the r-gsl library mentioned above. |
^r-mpi |CRAN modules which need the OpenMPI libraries for parallel computing. |
^r-netcdf |CRAN modules which need NetCDF, HDF4, HDF5, and UDUNITS libraries. |
^r-all |In addition to loading all the previously mentioned bundles, and CRAN module with multiple dependencies from the above list is also included. |
^r-cuda |Not currently available. |
=== Searching for modules ===
The HPC team provides a tool, which is loaded along with R, to help cluster
users locate modules in these various bundles,
it is called ''r-search'' and take arguments which will be
interpreted as extended regular expressions to the UNIX
[[http://linux.die.net/man/1/egrep|egrep]] command (with case-insensitivity
enabled by default).
If, for example, you are looking packages to help with a copula regression,
you may search for them as such:
$ r-search copula
R Location ValetPackage Library
------- -------- --------------- -----------
R/3.1.1 add-ons r-cran/20140905 : acopula
R/3.1.1 add-ons r-cran/20140905 : pencopula
R/3.1.1 add-ons r-gsl/20140912 : CopulaRegression
R/3.1.1 add-ons r-gsl/20140912 : VineCopula
R/3.1.1 add-ons r-gsl/20140912 : copula
R/3.1.1 add-ons r-gsl/20140912 : copulaedas
R/3.1.1 add-ons r-gsl/20140912 : nacopula
$
Not, it is clear that two bundles for version 3.1.1 of R contain modules
which may be of help. If you require the "CopulaRegression" module, you
may use valet to load it into your environment via the "r-gsl/20140912"
bundle.
=== Loading library bundles for use ===
$ vpkg_require r-gsl/20140912
Adding dependency `r-cran/20140905` to your environment
Adding dependency `gsl/1.16` to your environment
Adding dependency `glpk/4.55` to your environment
Adding dependency `mpfr/3.1.2` to your environment
Adding package `r-gsl/20140912` to your environment
$
Now using the library in R can be done as normal.
$ R --no-save -q
> library(CopulaRegression)
Loading required package: MASS
Loading required package: VineCopula
>
=== Learning about modules ===
IT provides a small script called ''r-info'' which will display the internal
documentation of R modules. This is helpful to get basic information on
a module to decide if it requires more research. To use this tool, the library
must be installed, and the module bundle must be loaded with ''vpkg_require''.
For example:
$ vpkg_require r/3.1.1 r-cran/3.1.1
$ r-info car
car-package package:car R Documentation
Companion to Applied Regression
Description:
This package accompanies Fox, J. and Weisberg, S., _An R Companion
to Applied Regression_, Second Edition, Sage, 2011.
Details:
...
Maintainer: John Fox
$
==== personal/program specific R libraries and extensions ====
You can create your own library of R modules which contains different
versions than provided through VALET, or modules not available via VALET.
R looks in an environment variable called 'R_LIBS' to obtain a list of
locations to search for modules. You should ensure your entry is first
in the list, this will allow your library to override any conflicts which
may be installed on the system. This is also important, because R installs
modules into the first entry in this list by default.
=== Simple example ===
Once this is done, you can use the install using ''install.packages''. Here
is an example:
$ vpkg_require r r-cran
Adding package `r/3.1.1` to your environment
Adding package `r-cran/20140905` to your environment
$ mkdir -p $WORKDIR/sw/r/add-ons/r3.1.1/testing/default
$ echo $R_LIBS
/opt/shared/r/add-ons/r3.1.1/cran/20140905
$ R_LIBS="$WORKDIR/sw/r/add-ons/r3.1.1/testing/default:$R_LIBS"
$ R -q --no-save
> .libPaths()
[1] "/home/work/it_nss/sw/r/add-ons/r3.1.1/testing/default"
[2] "/home/software/r/add-ons/r3.1.1/cran/20140905"
[3] "/home/software/r/3.1.1/lib64/R/library"
> chooseCRANmirror(all)
CRAN mirror
1: 0-Cloud 2: Argentina (La Plata)
3: Argentina (Mendoza) 4: Australia (Canberra)
5: Australia (Melbourne) 6: Austria
7: Belgium 8: Brazil (BA)
9: Brazil (PR) 10: Brazil (RJ)
11: Brazil (SP 1) 12: Brazil (SP 2)
13: Canada (BC) 14: Canada (NS)
15: Canada (ON) 16: Canada (QC 1)
17: Canada (QC 2) 18: Chile
19: China (Beijing 1) 20: China (Beijing 2)
21: China (Hefei) 22: China (Xiamen)
23: Colombia (Bogota) 24: Colombia (Cali)
25: Czech Republic 26: Denmark
27: Ecuador 28: Estonia
29: France (Lyon 1) 30: France (Lyon 2)
31: France (Montpellier) 32: France (Paris 1)
33: France (Paris 2) 34: France (Strasbourg)
35: Germany (Berlin) 36: Germany (Bonn)
37: Germany (Goettingen) 38: Greece
39: Hungary 40: Iceland
41: India 42: Indonesia (Jakarta)
43: Indonesia (Jember) 44: Iran
45: Ireland 46: Italy (Milano)
47: Italy (Padua) 48: Italy (Palermo)
49: Japan (Hyogo) 50: Japan (Tokyo)
51: Japan (Tsukuba) 52: Korea (Seoul 1)
53: Korea (Seoul 2) 54: Lebanon
55: Mexico (Mexico City) 56: Mexico (Texcoco)
57: Netherlands (Amsterdam) 58: Netherlands (Utrecht)
59: New Zealand 60: Norway
61: Philippines 62: Poland
63: Portugal 64: Russia
65: Singapore 66: Slovakia
67: South Africa (Cape Town) 68: South Africa (Johannesburg)
69: Spain (A Coru?a) 70: Spain (Madrid)
71: Sweden 72: Switzerland
73: Taiwan (Chungli) 74: Taiwan (Taichung)
75: Taiwan (Taipei) 76: Thailand
77: Turkey 78: UK (Bristol)
79: UK (Cambridge) 80: UK (London)
81: UK (London) 82: UK (St Andrews)
83: USA (CA 1) 84: USA (CA 2)
85: USA (IA) 86: USA (IN)
87: USA (KS) 88: USA (MD)
89: USA (MI) 90: USA (MO)
91: USA (OH) 92: USA (OR)
93: USA (PA 1) 94: USA (PA 2)
95: USA (TN) 96: USA (TX 1)
97: USA (WA 1) 98: USA (WA 2)
99: Venezuela 100: Vietnam
Selection: 88
> install.packages("KernSmooth", dependencies=TRUE)
Installing package into '/home/work/it_nss/sw/r/add-ons/r3.1.1/testing/default'
(as 'lib' is unspecified)
trying URL 'http://watson.nci.nih.gov/cran_mirror/src/contrib/KernSmooth_2.23-13.tar.gz'
Content type 'application/octet-stream' length 24471 bytes (23 Kb)
opened URL
==================================================
downloaded 23 Kb
* installing *source* package 'KernSmooth' ...
** package 'KernSmooth' successfully unpacked and MD5 sums checked
** libs
gfortran -fpic -g -O2 -c blkest.f -o blkest.o
gfortran -fpic -g -O2 -c cp.f -o cp.o
gfortran -fpic -g -O2 -c dgedi.f -o dgedi.o
gfortran -fpic -g -O2 -c dgefa.f -o dgefa.o
gfortran -fpic -g -O2 -c dgesl.f -o dgesl.o
gcc -std=gnu99 -I/opt/shared/r/3.1.1/lib64/R/include -DNDEBUG -I/usr/local/include -fpic -g -O2 -c init.c -o init.o
gfortran -fpic -g -O2 -c linbin.f -o linbin.o
gfortran -fpic -g -O2 -c linbin2D.f -o linbin2D.o
gfortran -fpic -g -O2 -c locpoly.f -o locpoly.o
gfortran -fpic -g -O2 -c rlbin.f -o rlbin.o
gfortran -fpic -g -O2 -c sdiag.f -o sdiag.o
gfortran -fpic -g -O2 -c sstdiag.f -o sstdiag.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o KernSmooth.so blkest.o cp.o dgedi.o dgefa.o dgesl.o init.o linbin.o linbin2D.o locpoly.o rlbin.o sdiag.o sstdiag.o -L/opt/shared/r/3.1.1/lib64/R/lib -lRblas -lgfortran -lm -lgfortran -lm -L/opt/shared/r/3.1.1/lib64/R/lib -lR
installing to /home/work/it_nss/sw/r/add-ons/r3.1.1/testing/default/KernSmooth/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (KernSmooth)
The downloaded source packages are in
'/tmp/RtmpylqWXj/downloaded_packages'
> library(KernSmooth)
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
>
Notice that the output of ''.libPaths()'' specifies my personal library
directory first?
=== Using IT's udbuild environment ===
IT developed a formalization for installing modules called [[/abstract/farber/install_software|udbuild]]
which can simplify the installation of modules. Here is an example ''udbuild''
script which can be used to install a personal R library.
#!/bin/bash -l
PKGNAME=testing
VERSION=default
UDBUILD_HOME=$WORKDIR/sw
PKG_LIST='
WideLM rpud permGPU magma gputools cudaBayesregData cudaBayesreg
CARramps
'
vpkg_devrequire udbuild r/3.1.1 r-cran/20140905
init_udbuildenv r-addon cuda/6.5
#Sometimes R doesn't properly use CPPFLAGS which is set by VALET, fix that here:
CPATH=$CUDA_PREFIX/include:$CPATH
LIBRARY_PATH=$CUDA_PREFIX/lib64:$CUDA_PREFIX/lib64/stubs:$LIBRARY_PATH
#CRAN_MIRROR='http://cran.cs.wwu.edu/'
CRAN_MIRROR='http://lib.stat.cmu.edu/R/CRAN/'
quote() { printf '"%s", ' "$@" | sed 's/, $/\n/'; }
R -q --no-save <
This script will attempt to build the cuda capable R modules using the
cuda 6.5 version into ''$WORKDIR/sw/r/add-ons/r3.1.1/testing/default-cuda-6.5''.
====== R script in batch ======
==== matmul.R script ====
Consider the simple R script file to multiply a small 3x3 matrix
# Calculate and print small matrix AA'
a <- matrix(1:12,3,4);
a%*%t(a)
Let's test this R script using ''Rscript'' from the command line on a compute node. Don't forget to set your [[general/userguide/04_compute_environ?using-workgroup-and-directories|workgroup]] to define your cluster group or //investing-entity// compute nodes before you use ''qlogin'' to get on a compute node. For example,
workgroup -g it_css
qlogin
vpkg_require r/3
Rscript matmul.R
The output to the screen:
[,1] [,2] [,3]
[1,] 166 188 210
[2,] 188 214 240
[3,] 210 240 270
To return to the head node, type
exit
==== matmul.qs file ====
To run a R script in batch instead of on the command line has nearly the same steps.
Consider the queue submission script file:
#$ -N matmultiply
# Add vpkg_require commands after this line:
vpkg_require r/3
# Syntax: Rscript [options] filename.R [arguments]
Rscript matmul.R
Now to run the R script simply submit the job from the head node with the
''qsub'' command.
qsub matmul.qs
You should see a notification that your job was submitted. Something like this
Your job 2283886 ("matmultiply") has been submitted
After the code completes the output of the script will appear in the file
''matmultiply.o2283886'' because ''-N matmultiply'' defines the name of the job in ''matmul.qs'' and appears in the notification above as ''("matmultiply")'' with ''2283886'' assigned as the job ID. Type
more matmultiply.o2283886
to display the contents of the output file on the screen. For example,
Adding dependency `x11/RHEL6.1` to your environment
Adding package `r/3.0.2` to your environment
[,1] [,2] [,3]
[1,] 166 188 210
[2,] 188 214 240
[3,] 210 240 270
====== Using R script in batch array job ======
===== sweep.R file =====
Consider the simple script to print a fraction from the argument list
args <- commandArgs(trailingOnly = TRUE)
# print fraction from argument list
as.numeric(args[1])/as.numeric(args[2])
This is a R script with can be run from the command line on a compute node the commands
qlogin
vpkg_require r/3
Rscript sweep.R 5 200
The output to the screen:
[1] 0.025
===== sweep.qs file =====
Consider the queue script file
#$ -N sweep
#$ -t 1-200
##
## Parameter sweep array job to run the sweep.R with
## lambda = 0,1,2. ... 199
##
# Add vpkg_require commands after this line:
vpkg_require r/3
date "+Start %s"
echo "Host $HOSTNAME"
let lambda="$SGE_TASK_ID-1"
let taskCount=200
# Syntax: Rscript [options] filename.R [arguments]
Rscript --vanilla sweep.R $lambda $taskCount
date "+Finish %s"
The ''date'' and ''echo Host'' lines are just a way of keeping track of when and where the jobs are run.
There will be 200 array jobs all running the same script with different parameters (arguments). The ''--vanilla'' option
is used to prevent the multiple jobs from using the same disk space.
To run this in batch you must submit the job from the head node with the
''qsub'' command.
qsub sweep.qs
After the code completes the output of the script will appear in the files
''sweep.o535064.1'' to ''sweep.o535064.200''. The number 535064 is the job ID assigned to your job when submitted, and 1 to 200 is the Task ID (e.g. corresponds to the ''-t 1-200'')
Adding dependency `x11/RHEL6.1` to your environment
Adding package `r/3.0.2` to your environment
[1] 0.025
You will want to do more than just print out one fraction in your script. The integer parameter can be used for
a one dimensional parameter sweep, to construct unique input and output file names for each task,
or as a seed for the R Random Number Generator (RNG).
==== Writing files from an array job ====
You are running many jobs in the same directory. Grid engine handles the standard output by writing to
separate files with "dot taskid" appended to the jobid. You need to take care of other file output in your R script.
You need to make sure no two of your jobs will write to the same file. Look at your R script to see if you
are writing files. Look for the ''**sink**'' command or any graphics writing commands such as ''**pdf**'' or ''**png**''.
If you are using these R functions, then use a unique file name constructed from the task id.
==== vanilla option ====
The command-line option ''--vanilla'' implies --no-site-file, --no-init-file and --no-environ. This way you will not
be reading or writing to the same files. If you need initialization command, put them in your R script instead of in
in the init-file ''.Rprofile''. If you need some environment variables, export them in your bash script instead of assigning
them in your environ file ''.Renviron''.