====== TensorFlow using AMD GPUs ======

Complementing the NVIDIA T4 and V100 GPU nodes in the DARWIN cluster are nodes with an AMD Mi50 and an AMD Mi100 GPU coprocessor, respectively.  The AMD GPU device and software stack are far newer than NVIDIA's CUDA stack: support for them is evolving.

One popular AI/ML framework, TensorFlow, has an official DockerHub container release.  DARWIN includes versions of the Singularity toolset which can make use of Docker container images.  With only a single node of each present in the cluster, multi-node parallelism is not likely to be necessary:  an official container should satisfy most user's needs.

Building the container, however, is a challenge.  Several versions were tried (2.9, 2.8, and an "ancient" 2.3 TensorFlow) with mixed results:

  * The 2.3 image worked as-is
  * The 2.8 and 2.9 images lacked a number of Python modules that prevented tensorflow from loading (numpy, google, protobuf, etc.)
  * The blobs that comprise the container images are quite large — I hit my home directory quota thanks to the blobs Singularity cached in ''~/.singularity''

<WRAP center round important 60%>
Singularity by default maps the user's home directory to the container.  This means that Python modules present under ''~/.local'' will be visible inside the container and may satisfy some of the missing module dependencies.  It's possible those maintaining the official containers on DockerHub have the same scenario interfering with dependency resolution while building the container images.
</WRAP>

The solution with the 2.8 and 2.9 images was to initially build them as read+write //sandbox// images, then run them as user root.  Root has privileges to write to ''/usr'' therein, so ''pip'' can be used to iteratively install missing dependencies.  When all dependencies are satisfied, a read-only image file is generated from the sandbox image.

====== Producing the 2.9 image ======

After adding Singularity to the runtime environment, a directory for the sandbox and read-only images is created and a temporary Singularity cache directory is created:

<code bash>
$ vpkg_require singularity/default
Adding dependency `squashfs-tools/4.5.1` to your environment
Adding package `singularity/3.10.0` to your environment

$ mkdir -p /opt/shared/singularity/images/tensorflow/2.9-rocm

$ export SINGULARITY_CACHEDIR="$(mktemp -d)"
</code>

For the duration of container builds in this shell, the temp directory will hold all cached blobs (rather than putting them in my NFS home directory).  Before exiting this shell it is important that that cache directory be removed — don't forget!

The sandbox image is built thusly:

<code bash>
$ singularity build --sandbox /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif \
    docker://rocm/tensorflow:rocm5.2.0-tf2.9-dev
INFO:    Starting build...
Getting image source signatures
Copying blob 8751bf8569be done  
Copying blob 8751bf8569be done 
   :
2022/07/07 10:12:48  info unpack layer: sha256:4da118ab357bd39e3ce7f4cf1924ae3e4c4421c1a96460bdf0ad33cea2abc496
2022/07/07 10:12:51  info unpack layer: sha256:e9d0198e6dd5d7f70a7a516c3b7fdcdd43844dbb3efec79d68a30f6aa00d3cd8
2022/07/07 10:12:54  info unpack layer: sha256:2206655d3afe836157d63f1a7dc5951244a2058a457abdf6ac0715d482f45636
INFO:    Creating sandbox directory...
INFO:    Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif
</code>

Next, a shell is started in the sandbox container and used to iteratively add missing Python packages.  Since Singularity runs processes as the initiating user inside the container, it's necessary to be root at this point (so DARWIN users are out of luck here, unless they replicate this work on a machine on which they have root).

<code bash>
$ sudo -s
[root]$ singularity shell --pid --ipc --writable /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif

Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
   :
ModuleNotFoundError: No module named 'absl'

Singularity> pip3 install absl-py
Collecting absl-py
  Downloading absl_py-1.1.0-py3-none-any.whl (123 kB)
     |████████████████████████████████| 123 kB 13.2 MB/s 
   :
Installing collected packages: absl-py
Successfully installed absl-py-1.1.0

Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"

   :
ModuleNotFoundError: No module named 'numpy'
Singularity> pip3 install numpy  
Collecting numpy
  Using cached numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
   :

Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
2022-07-07 11:33:23.743348: E tensorflow/stream_executor/rocm/rocm_driver.cc:305] failed call to hipInit: HIP_ERROR_InvalidDevice
2022-07-07 11:33:23.743406: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:112] retrieving ROCM diagnostic information for host: r0login0.localdomain.hpc.udel.edu
2022-07-07 11:33:23.743421: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:119] hostname: r0login0.localdomain.hpc.udel.edu
2022-07-07 11:33:23.743466: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:142] librocm reported version is: NOT_FOUND: was unable to find librocm.so DSO loaded into this program
2022-07-07 11:33:23.743522: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:146] kernel reported version is: UNIMPLEMENTED: kernel reported driver version not implemented
</code>

Since the node on which I was building this container does not have an AMD GPU, no GPU is found by the now-working tensorflow module.  At this point the read-only container image can be created:

<code bash>
[root]$ exit
$ singularity build /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif \
    /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif
INFO:    Starting build...
INFO:    Creating SIF file...
INFO:    Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif
</code>

====== Test the 2.9 image ======

At this point it's necessary to check that the read-only image sees the AMD GPU device.  This requires a remote shell on the GPU node.  Note that for testing purposes the "idle" partition will be used; production usage of the GPU nodes should make use of the "gpu-mi50" and "gpu-mi100" partitions on DARWIN.

<code bash>
$ salloc --partition=idle --gpus=amd_mi100:1
salloc: Pending job allocation ######
salloc: job 918608 queued and waiting for resources
salloc: job 918608 has been allocated resources
salloc: Granted job allocation ######
salloc: Waiting for resource configuration
salloc: Nodes r0m01 are ready for job
</code>

Add Singularity to the runtime environment and reference the 2.9 container image:

<code bash>
$ vpkg_require singularity/default
Adding dependency `squashfs-tools/4.5.1` to your environment
Adding package `singularity/3.10.0` to your environment

$ export SINGULARITY_IMAGE=/opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif

$ Sshell
Singularity> rocm-smi

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    27.0c  34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================
</code>

Excellent!  Inside the container, the ''rocm-smi'' utility is able to see the AMD GPU and read stats from it.  Finally, will tensorflow be able to see it?

<code bash>
Singularity> python3
Python 3.9.13 (main, May 23 2022, 22:01:06) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> 
</code>

====== VALET configuration ======

To make the containers as easy to use for end users, VALET is leveraged:

<code bash>
$ cat /opt/shared/valet/etc/tensorflow.vpkg_yaml
tensorflow:
    prefix: /opt/shared/singularity/images/tensorflow
    description: official TensorFlow containers
    url: "https://hub.docker.com/r/rocm/tensorflow"

    actions:
        - variable: SINGULARITY_IMAGE
          action: set
          value: ${VALET_PATH_PREFIX}/tensorflow.sif

    dependencies:
        - singularity/default

    versions:
        "2.9:rocm":
            description: TF 2.9 with ROCM 5.2.0 AMD GPU support

        "2.8:rocm":
            description: TF 2.8 with ROCM 5.2.0 AMD GPU support

        "2.3:rocm":
            description: TF 2.3 with ROCM 4.2 AMD GPU support

$ vpkg_versions tensorflow

Available versions in package (* = default version):

[/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml]
tensorflow  official TensorFlow containers
  2.3:rocm  TF 2.3 with ROCM 4.2 AMD GPU support
  2.8:rocm  TF 2.8 with ROCM 5.2.0 AMD GPU support
* 2.9:rocm  TF 2.9 with ROCM 5.2.0 AMD GPU support
</code>

====== End-user usage of 2.9 image ======

To test a Python script running the the tensorflow container, we first write the Python script:

<file python tf-test.py>
#!/usr/bin/env python

import tensorflow as tf

print('There are {:d} GPU devices visible to this TF'.format(
    len(tf.config.list_physical_devices('GPU')))
  )

</file>

The job script requests all memory, all 128 CPU cores, and the Mi100 GPU device:

<file bash tf-test.qs>
#!/bin/bash
#
#SBATCH --partition=gpu-mi100
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=128
#SBATCH --mem=0
#SBATCH --gpus=amd_mi100:1
#

vpkg_require tensorflow/2.9:rocm

Srun python3 tf-test.py

</file>

These two files were copied to a working directory named ''tf-2.9-test'' (in my home directory) and the batch job submitted:

<code bash>
$ cd ~/tf-2.9-test

$ ls -l
total 19
-rw-r--r-- 1 frey it_nss 164 Jul  7 12:55 tf-test.py
-rw-r--r-- 1 frey it_nss 214 Jul  7 12:54 tf-test.qs

$ sbatch tf-test.qs
Submitted batch job 918657
</code>

After the job has run:

<code bash>
$ ls -l
total 29
-rw-r--r-- 1 frey it_nss 220 Jul  7 12:55 slurm-918657.out
-rw-r--r-- 1 frey it_nss 164 Jul  7 12:55 tf-test.py
-rw-r--r-- 1 frey it_nss 214 Jul  7 12:54 tf-test.qs

$ cat slurm-918657.out 
Adding dependency `squashfs-tools/4.5.1` to your environment
Adding dependency `singularity/3.10.0` to your environment
Adding package `tensorflow/2.9:rocm` to your environment
There are 1 GPU devices visible to this TF
</code>

Hooray!