This is an old revision of the document!
Tensorflow on DARWIN
TensorFlow is a combination of Python scripted software and compiled libraries and tools. Building TensorFlow from source is extremely involved due to the number of dependencies and additional software packages involved. There are container images of pre-built TensorFlow environments available on DockerHub, and conda packages are available (but tend to lag behind the current release of TensorFlow by significant periods of time).
On DARWIN, only container images are provided to users. Users are welcome to curate their own Python TensorFlow virtual environments. Use of both variants is documented here.
Container images
IT RCI maintains TensorFlow Singularity containers for all users of DARWIN:
$ vpkg_versions tensorflow Available versions in package (* = default version): [/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml] tensorflow official TensorFlow containers 2.3:rocm TF 2.3 with ROCM 4.2 AMD GPU support * 2.8:rocm TF 2.8 with ROCM 5.2.0 AMD GPU support 2.9:rocm TF 2.9 with ROCM 5.2.0 AMD GPU support 2.14.0 TF 2.14.0 official Docker runtime image 2.15:rocm TF 2.15 with ROCM 6.1 AMD GPU support 2.16.1 TF 2.16.1 official Docker runtime image
You write your Python code either somewhere in your home directory ($HOME) or somewhere under your workgroup directory ($WORKDIR). You should speak to other group members to understand how you should make use of the workgroup directory, e.g. create a directory for yourself, etc.
Assuming you will use your personal workgroup storage directory ($WORKDIR_USER
), create a directory therein for your first TensorFlow job:
$ mkdir -p ${WORKDIR_USER}/tf-test-001 $ cd ${WORKDIR_USER}/tf-test-001
For example, say your TensorFlow Python script is called tf-script.py
, then you should copy this file or create it in the tf-test-001
directory, then copy the tensorflow.qs job script template:
$ cp /opt/shared/templates/slurm/applications/tensorflow.qs .
The job script template has extensive documentation that should assist you in customizing it for the job. Last but not least, you need to specify the version of Tensorflow you want via VALET, and then the last line should be changed to match your Python script name and for this example, so for this example it would be tf-script.py
:
... # # Add a TensorFlow container to the environment: # vpkg_require tensorflow/2.16.1 # # Execute our TensorFlow Python script: # python3 tf-script.py
Finally, submit the job using the sbatch
command:
$ sbatch tensorflow.qs
Coprocessor usage
The DARWIN cluster includes nodes with NVIDIA (CUDA-based) GPGPUs and AMD (ROCM-based) GPUs. TensorFlow images with support for these coprocessors are available. Check the vpkg_versions tensorflow
listing for versions with the tag rocm
and gpu
.