====== Tensorflow on DARWIN ======
TensorFlow is a combination of Python scripted software and compiled libraries and tools. Building TensorFlow from source is extremely involved due to the number of dependencies and additional software packages involved. There are container images of pre-built TensorFlow environments available on DockerHub, and conda packages are available (but tend to lag behind the current release of TensorFlow by significant periods of time).
On DARWIN, only container images are provided to users. Users are welcome to curate their own Python TensorFlow virtual environments. Use of both variants is documented here.
===== Container images =====
IT RCI maintains TensorFlow Singularity containers for all users of DARWIN:
$ vpkg_versions tensorflow
Available versions in package (* = default version):
[/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml]
tensorflow official TensorFlow containers
2.3:rocm TF 2.3 with ROCM 4.2 AMD GPU support
* 2.8:rocm TF 2.8 with ROCM 5.2.0 AMD GPU support
2.9:rocm TF 2.9 with ROCM 5.2.0 AMD GPU support
2.14.0 TF 2.14.0 official Docker runtime image
2.15:rocm TF 2.15 with ROCM 6.1 AMD GPU support
2.16.1 TF 2.16.1 official Docker runtime image
You write your Python code either somewhere in your home directory ($HOME) or somewhere under your workgroup directory ($WORKDIR). You should speak to other group members to understand how you should make use of the workgroup directory, e.g. create a directory for yourself, etc.
Assuming you will use your personal workgroup storage directory (''$WORKDIR_USER''), create a directory therein for your first TensorFlow job:
$ mkdir -p ${WORKDIR_USER}/tf-test-001
$ cd ${WORKDIR_USER}/tf-test-001
For example, say your TensorFlow Python script is called ''tf-script.py'', then you should copy this file or create it in the ''tf-test-001'' directory, then copy the tensorflow.qs job script template:
$ cp /opt/shared/templates/slurm/applications/tensorflow.qs .
The job script template has extensive documentation that should assist you in customizing it for the job. Last but not least, you need to specify the version of Tensorflow you want via VALET, and then the last line should be changed to match your Python script name and for this example, so for this example it would be ''tf-script.py'':
:
#
# Add a TensorFlow container to the environment:
#
vpkg_require tensorflow/2.16.1
#
# Execute our TensorFlow Python script:
#
python3 tf-script.py
Finally, submit the job using the ''sbatch'' command:
$ sbatch tensorflow.qs
==== Coprocessor usage ====
The DARWIN cluster includes nodes with NVIDIA (CUDA-based) GPGPUs and AMD (ROCM-based) GPUs. TensorFlow images with support for these coprocessors are available. Check the ''vpkg_versions tensorflow'' listing for versions with the tag ''rocm'' and ''gpu''.
===== Virtual environments =====
As of 2024, Anaconda virtual environments are suggested for TensorFlow virtual environments. This recipe assumes the user is adding the software to shared workgroup storage, ''${WORKDIR_SW}/tensorflow'' and ''${WORKDIR_SW}/valet''.
Start by adding the Anaconda distribution base to the environment (here ''2024.02:python3'' is used, but you should always check for newer versions with ''vpkg_versions''):
[(my_workgroup:user)@login01.darwin ~]$ vpkg_require anaconda/2024.02:python3
Adding package `anaconda/2024.02:python3` to your environment
[(my_workgroup:user)@login01.darwin ~]$
The ''conda search tensorflow'' command can be used to locate the specific version you wish to install. Two examples are shown:
[(my_workgroup:user)@login01.darwin ~]$ conda search tensorflow
Loading channels: done
# Name Version Build Channel
tensorflow 1.4.1 0 pkgs/main
tensorflow 1.5.0 0 pkgs/main
:
tensorflow 2.11.0 eigen_py310h0f08fec_0 pkgs/main
:
tensorflow 2.12.0 gpu_py38h03d86b3_0 pkgs/main
:
tensorflow 2.12.0 mkl_py39h5ea9445_0 pkgs/main
Note that the build tag provides the distinction between variants built on top of specific devices or libraries. For example, the final item above is built atop the Intel MKL infrastructure and translates to the qualified conda package name ''tensorflow[version=2.12.0,build= mkl_py39h5ea9445_0]''.
All versions of the TensorFlow virtualenv will be stored in the common base directory, ''${WORKDIR_SW}/tensorflow''; each virtualenv must have a unique name that will become the VALET version. In this tutorial, the latest version of TensorFlow with MKL support will be installed using the tag ''mkl'' on the version:
[(my_workgroup:user)@login01 ~]$ vpkg_id2path --version-id=2.12.0:mkl
2.12.0-mkl
The virtualenv is created using the ''%%--%%prefix'' option to direct the installation to the desired directory:
[(my_workgroup:user)@login01 ~]$ conda create --prefix=${WORKDIR_SW}/tensorflow/2.12.0-mkl 'tensorflow[version=2.12.0,build=mkl_py39h5ea9445_0]'
:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate /lustre/my_workgroup/sw/tensorflow/2.12.0-mkl
#
# To deactivate an active environment, use
#
# $ conda deactivate
==== VALET package definition ====
Assuming the workgroup does //not// already have a TensorFlow VALET package definition, the following YAML config can be modified (e.g. alter the ''prefix'' path) and added to the file ''${WORKDIR_SW}/valet/tensorflow.vpkg_yaml'':
tensorflow:
prefix: /lustre/my_workgroup/sw/tensorflow
description: TensorFlow Python environments
url: "https://www.tensorflow.org"
flags:
- no-standard-paths
versions:
"2.12.0:mkl":
description: 2.12.0, mkl_py39h5ea9445_0 build
dependencies:
- anaconda/2024.02:python3
actions:
- action: source
script:
sh: anaconda-activate-2024.sh
success: 0
If the ''${WORKDIR_SW}/valet/tensorflow.vpkg_yaml'' file already exists, add the new version at the same level as others (under the ''versions'' key):
:
"2.12.0:mkl":
description: 2.12.0, mkl_py39h5ea9445_0 build
dependencies:
- anaconda/2024.02:python3
actions:
- action: source
script:
sh: anaconda-activate-2024.sh
success: 0
"2.12.0:gpu":
description: 2.12.0, gpu_py311h65739b5_0 build
:
With a properly-constructed package definition file, you can now check for your versions of TensorFlow:
[(it_nss:frey)@login00 ~]$ vpkg_versions tensorflow
Available versions in package (* = default version):
[/lustre/my_workgroup/sw/valet/tensorflow.vpkg_yaml]
tensorflow
* 2.12.0:mkl 2.12.0, mkl_py39h5ea9445_0 build
:
==== Job scripts ====
Any job scripts designed to run scripts using this virtualenv should include something like the following toward its end:
:
#
# Setup TensorFlow virtualenv:
#
vpkg_require tensorflow/2.12.0:mkl
#
# Run a Python script in that virtualenv:
#
python3 my_tf_work.py
rc=$?
#
# Do cleanup work, etc....
#
#
# Exit with whatever exit code our Python script handed back:
#
exit $rc