This is an old revision of the document!
Tensorflow on Caviness
TensorFlow must be used as a container, so the versions of TensorFlow installed on Caviness are provided as containers.
$ vpkg_versions tensorflow Available versions in package (* = default version): [/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml] tensorflow an end-to-end open source machine learning platform 1.12.0 release 1.12.0 1.12.0:gpu release 1.12.0-gpu (uses CUDA toolkit 9.0) 1.12.0:gpu,py3 release 1.12.0-gpu-py3 (uses CUDA toolkit 9.0) 1.12.0:py3 release 1.12.0-py3 * 1.13.1 release 1.13.1 1.13.1:gpu release 1.13.1-gpu (uses CUDA toolkit 10.0) 1.13.1:gpu,py3 release 1.13.1-gpu-py3 (uses CUDA toolkit 10.0) 1.13.1:py3 release 1.13.1-py3
You write your Python code either somewhere in your home directory ($HOME) or in the workgroup directory ($WORKDIR). You should speak to other group members to understand how you should make use of the workgroup directory, e.g. create a directory for yourself, etc.
Remember you must specify your workgroup to define your cluster group or investing-entity compute nodes before submitting any job, and this includes starting an interactive session or submitting a batch job.
$ workgroup -g «investing-entity»
Assuming you created your personal workgroup storage area as $WORKDIR/$USER
, create a directory therein for your first TensorFlow job:
$ mkdir -p ${WORKDIR}/${USER}/tf-test-001 $ cd ${WORKDIR}/${USER}/tf-test-001
For example, say your TensorFlow Python script is called tf-script.py
, then you should copy this file or create it the tf-test-001
directory, then copy the tensorflow.qs job script template:
$ cp /opt/shared/tensorflow/tensorflow.qs .
You will need to modify the copy of tensorflow.qs
accordingly (--cpus-per-task=2
to however many CPU cores you need, 1 - 36; --mem-per-cpu=1024M
to alter max memory limit; --job-name=tensorflow
, etc.). The template has extensive documentation that should assist you in customizing it for the job. Last but not least, the last line should be changed to match your Python script name and for this example, it it would be tf-script.py
:
# # Execute our TensorFlow Python script: # python tf-script.py
Finally, submit the job using the sbatch
command:
$ sbatch tensorflow.qs