====== TensorFlow using AMD GPUs ====== Complementing the NVIDIA T4 and V100 GPU nodes in the DARWIN cluster are nodes with an AMD Mi50 and an AMD Mi100 GPU coprocessor, respectively. The AMD GPU device and software stack are far newer than NVIDIA's CUDA stack: support for them is evolving. One popular AI/ML framework, TensorFlow, has an official DockerHub container release. DARWIN includes versions of the Singularity toolset which can make use of Docker container images. With only a single node of each present in the cluster, multi-node parallelism is not likely to be necessary: an official container should satisfy most user's needs. Building the container, however, is a challenge. Several versions were tried (2.9, 2.8, and an "ancient" 2.3 TensorFlow) with mixed results: * The 2.3 image worked as-is * The 2.8 and 2.9 images lacked a number of Python modules that prevented tensorflow from loading (numpy, google, protobuf, etc.) * The blobs that comprise the container images are quite large — I hit my home directory quota thanks to the blobs Singularity cached in ''~/.singularity'' Singularity by default maps the user's home directory to the container. This means that Python modules present under ''~/.local'' will be visible inside the container and may satisfy some of the missing module dependencies. It's possible those maintaining the official containers on DockerHub have the same scenario interfering with dependency resolution while building the container images. The solution with the 2.8 and 2.9 images was to initially build them as read+write //sandbox// images, then run them as user root. Root has privileges to write to ''/usr'' therein, so ''pip'' can be used to iteratively install missing dependencies. When all dependencies are satisfied, a read-only image file is generated from the sandbox image. ====== Producing the 2.9 image ====== After adding Singularity to the runtime environment, a directory for the sandbox and read-only images is created and a temporary Singularity cache directory is created: $ vpkg_require singularity/default Adding dependency `squashfs-tools/4.5.1` to your environment Adding package `singularity/3.10.0` to your environment $ mkdir -p /opt/shared/singularity/images/tensorflow/2.9-rocm $ export SINGULARITY_CACHEDIR="$(mktemp -d)" For the duration of container builds in this shell, the temp directory will hold all cached blobs (rather than putting them in my NFS home directory). Before exiting this shell it is important that that cache directory be removed — don't forget! The sandbox image is built thusly: $ singularity build --sandbox /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif \ docker://rocm/tensorflow:rocm5.2.0-tf2.9-dev INFO: Starting build... Getting image source signatures Copying blob 8751bf8569be done Copying blob 8751bf8569be done : 2022/07/07 10:12:48 info unpack layer: sha256:4da118ab357bd39e3ce7f4cf1924ae3e4c4421c1a96460bdf0ad33cea2abc496 2022/07/07 10:12:51 info unpack layer: sha256:e9d0198e6dd5d7f70a7a516c3b7fdcdd43844dbb3efec79d68a30f6aa00d3cd8 2022/07/07 10:12:54 info unpack layer: sha256:2206655d3afe836157d63f1a7dc5951244a2058a457abdf6ac0715d482f45636 INFO: Creating sandbox directory... INFO: Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif Next, a shell is started in the sandbox container and used to iteratively add missing Python packages. Since Singularity runs processes as the initiating user inside the container, it's necessary to be root at this point (so DARWIN users are out of luck here, unless they replicate this work on a machine on which they have root). $ sudo -s [root]$ singularity shell --pid --ipc --writable /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')" : ModuleNotFoundError: No module named 'absl' Singularity> pip3 install absl-py Collecting absl-py Downloading absl_py-1.1.0-py3-none-any.whl (123 kB) |████████████████████████████████| 123 kB 13.2 MB/s : Installing collected packages: absl-py Successfully installed absl-py-1.1.0 Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')" : ModuleNotFoundError: No module named 'numpy' Singularity> pip3 install numpy Collecting numpy Using cached numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB) : Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')" 2022-07-07 11:33:23.743348: E tensorflow/stream_executor/rocm/rocm_driver.cc:305] failed call to hipInit: HIP_ERROR_InvalidDevice 2022-07-07 11:33:23.743406: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:112] retrieving ROCM diagnostic information for host: r0login0.localdomain.hpc.udel.edu 2022-07-07 11:33:23.743421: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:119] hostname: r0login0.localdomain.hpc.udel.edu 2022-07-07 11:33:23.743466: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:142] librocm reported version is: NOT_FOUND: was unable to find librocm.so DSO loaded into this program 2022-07-07 11:33:23.743522: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:146] kernel reported version is: UNIMPLEMENTED: kernel reported driver version not implemented Since the node on which I was building this container does not have an AMD GPU, no GPU is found by the now-working tensorflow module. At this point the read-only container image can be created: [root]$ exit $ singularity build /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif \ /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif INFO: Starting build... INFO: Creating SIF file... INFO: Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif ====== Test the 2.9 image ====== At this point it's necessary to check that the read-only image sees the AMD GPU device. This requires a remote shell on the GPU node. Note that for testing purposes the "idle" partition will be used; production usage of the GPU nodes should make use of the "gpu-mi50" and "gpu-mi100" partitions on DARWIN. $ salloc --partition=idle --gpus=amd_mi100:1 salloc: Pending job allocation ###### salloc: job 918608 queued and waiting for resources salloc: job 918608 has been allocated resources salloc: Granted job allocation ###### salloc: Waiting for resource configuration salloc: Nodes r0m01 are ready for job Add Singularity to the runtime environment and reference the 2.9 container image: $ vpkg_require singularity/default Adding dependency `squashfs-tools/4.5.1` to your environment Adding package `singularity/3.10.0` to your environment $ export SINGULARITY_IMAGE=/opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif $ Sshell Singularity> rocm-smi ======================= ROCm System Management Interface ======================= ================================= Concise Info ================================= GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 27.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% ================================================================================ ============================= End of ROCm SMI Log ============================== Excellent! Inside the container, the ''rocm-smi'' utility is able to see the AMD GPU and read stats from it. Finally, will tensorflow be able to see it? Singularity> python3 Python 3.9.13 (main, May 23 2022, 22:01:06) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> tf.config.list_physical_devices('GPU') [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] >>> ====== VALET configuration ====== To make the containers as easy to use for end users, VALET is leveraged: $ cat /opt/shared/valet/etc/tensorflow.vpkg_yaml tensorflow: prefix: /opt/shared/singularity/images/tensorflow description: official TensorFlow containers url: "https://hub.docker.com/r/rocm/tensorflow" actions: - variable: SINGULARITY_IMAGE action: set value: ${VALET_PATH_PREFIX}/tensorflow.sif dependencies: - singularity/default versions: "2.9:rocm": description: TF 2.9 with ROCM 5.2.0 AMD GPU support "2.8:rocm": description: TF 2.8 with ROCM 5.2.0 AMD GPU support "2.3:rocm": description: TF 2.3 with ROCM 4.2 AMD GPU support $ vpkg_versions tensorflow Available versions in package (* = default version): [/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml] tensorflow official TensorFlow containers 2.3:rocm TF 2.3 with ROCM 4.2 AMD GPU support 2.8:rocm TF 2.8 with ROCM 5.2.0 AMD GPU support * 2.9:rocm TF 2.9 with ROCM 5.2.0 AMD GPU support ====== End-user usage of 2.9 image ====== To test a Python script running the the tensorflow container, we first write the Python script: #!/usr/bin/env python import tensorflow as tf print('There are {:d} GPU devices visible to this TF'.format( len(tf.config.list_physical_devices('GPU'))) ) The job script requests all memory, all 128 CPU cores, and the Mi100 GPU device: #!/bin/bash # #SBATCH --partition=gpu-mi100 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=128 #SBATCH --mem=0 #SBATCH --gpus=amd_mi100:1 # vpkg_require tensorflow/2.9:rocm Srun python3 tf-test.py These two files were copied to a working directory named ''tf-2.9-test'' (in my home directory) and the batch job submitted: $ cd ~/tf-2.9-test $ ls -l total 19 -rw-r--r-- 1 frey it_nss 164 Jul 7 12:55 tf-test.py -rw-r--r-- 1 frey it_nss 214 Jul 7 12:54 tf-test.qs $ sbatch tf-test.qs Submitted batch job 918657 After the job has run: $ ls -l total 29 -rw-r--r-- 1 frey it_nss 220 Jul 7 12:55 slurm-918657.out -rw-r--r-- 1 frey it_nss 164 Jul 7 12:55 tf-test.py -rw-r--r-- 1 frey it_nss 214 Jul 7 12:54 tf-test.qs $ cat slurm-918657.out Adding dependency `squashfs-tools/4.5.1` to your environment Adding dependency `singularity/3.10.0` to your environment Adding package `tensorflow/2.9:rocm` to your environment There are 1 GPU devices visible to this TF Hooray!