technical:recipes:tensorflow-rocm

TensorFlow using AMD GPUs

Complementing the NVIDIA T4 and V100 GPU nodes in the DARWIN cluster are nodes with an AMD Mi50 and an AMD Mi100 GPU coprocessor, respectively. The AMD GPU device and software stack are far newer than NVIDIA's CUDA stack: support for them is evolving.

One popular AI/ML framework, TensorFlow, has an official DockerHub container release. DARWIN includes versions of the Singularity toolset which can make use of Docker container images. With only a single node of each present in the cluster, multi-node parallelism is not likely to be necessary: an official container should satisfy most user's needs.

Building the container, however, is a challenge. Several versions were tried (2.9, 2.8, and an "ancient" 2.3 TensorFlow) with mixed results:

  • The 2.3 image worked as-is
  • The 2.8 and 2.9 images lacked a number of Python modules that prevented tensorflow from loading (numpy, google, protobuf, etc.)
  • The blobs that comprise the container images are quite large — I hit my home directory quota thanks to the blobs Singularity cached in ~/.singularity

Singularity by default maps the user's home directory to the container. This means that Python modules present under ~/.local will be visible inside the container and may satisfy some of the missing module dependencies. It's possible those maintaining the official containers on DockerHub have the same scenario interfering with dependency resolution while building the container images.

The solution with the 2.8 and 2.9 images was to initially build them as read+write sandbox images, then run them as user root. Root has privileges to write to /usr therein, so pip can be used to iteratively install missing dependencies. When all dependencies are satisfied, a read-only image file is generated from the sandbox image.

Producing the 2.9 image

After adding Singularity to the runtime environment, a directory for the sandbox and read-only images is created and a temporary Singularity cache directory is created:

$ vpkg_require singularity/default
Adding dependency `squashfs-tools/4.5.1` to your environment
Adding package `singularity/3.10.0` to your environment
 
$ mkdir -p /opt/shared/singularity/images/tensorflow/2.9-rocm
 
$ export SINGULARITY_CACHEDIR="$(mktemp -d)"

For the duration of container builds in this shell, the temp directory will hold all cached blobs (rather than putting them in my NFS home directory). Before exiting this shell it is important that that cache directory be removed — don't forget!

The sandbox image is built thusly:

$ singularity build --sandbox /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif \
    docker://rocm/tensorflow:rocm5.2.0-tf2.9-dev
INFO:    Starting build...
Getting image source signatures
Copying blob 8751bf8569be done  
Copying blob 8751bf8569be done 
   :
2022/07/07 10:12:48  info unpack layer: sha256:4da118ab357bd39e3ce7f4cf1924ae3e4c4421c1a96460bdf0ad33cea2abc496
2022/07/07 10:12:51  info unpack layer: sha256:e9d0198e6dd5d7f70a7a516c3b7fdcdd43844dbb3efec79d68a30f6aa00d3cd8
2022/07/07 10:12:54  info unpack layer: sha256:2206655d3afe836157d63f1a7dc5951244a2058a457abdf6ac0715d482f45636
INFO:    Creating sandbox directory...
INFO:    Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif

Next, a shell is started in the sandbox container and used to iteratively add missing Python packages. Since Singularity runs processes as the initiating user inside the container, it's necessary to be root at this point (so DARWIN users are out of luck here, unless they replicate this work on a machine on which they have root).

$ sudo -s
[root]$ singularity shell --pid --ipc --writable /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif
 
Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
   :
ModuleNotFoundError: No module named 'absl'
 
Singularity> pip3 install absl-py
Collecting absl-py
  Downloading absl_py-1.1.0-py3-none-any.whl (123 kB)
     |████████████████████████████████| 123 kB 13.2 MB/s 
   :
Installing collected packages: absl-py
Successfully installed absl-py-1.1.0
 
Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
 
   :
ModuleNotFoundError: No module named 'numpy'
Singularity> pip3 install numpy  
Collecting numpy
  Using cached numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
   :
 
Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
2022-07-07 11:33:23.743348: E tensorflow/stream_executor/rocm/rocm_driver.cc:305] failed call to hipInit: HIP_ERROR_InvalidDevice
2022-07-07 11:33:23.743406: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:112] retrieving ROCM diagnostic information for host: r0login0.localdomain.hpc.udel.edu
2022-07-07 11:33:23.743421: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:119] hostname: r0login0.localdomain.hpc.udel.edu
2022-07-07 11:33:23.743466: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:142] librocm reported version is: NOT_FOUND: was unable to find librocm.so DSO loaded into this program
2022-07-07 11:33:23.743522: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:146] kernel reported version is: UNIMPLEMENTED: kernel reported driver version not implemented

Since the node on which I was building this container does not have an AMD GPU, no GPU is found by the now-working tensorflow module. At this point the read-only container image can be created:

[root]$ exit
$ singularity build /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif \
    /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif
INFO:    Starting build...
INFO:    Creating SIF file...
INFO:    Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif

Test the 2.9 image

At this point it's necessary to check that the read-only image sees the AMD GPU device. This requires a remote shell on the GPU node. Note that for testing purposes the "idle" partition will be used; production usage of the GPU nodes should make use of the "gpu-mi50" and "gpu-mi100" partitions on DARWIN.

$ salloc --partition=idle --gpus=amd_mi100:1
salloc: Pending job allocation ######
salloc: job 918608 queued and waiting for resources
salloc: job 918608 has been allocated resources
salloc: Granted job allocation ######
salloc: Waiting for resource configuration
salloc: Nodes r0m01 are ready for job

Add Singularity to the runtime environment and reference the 2.9 container image:

$ vpkg_require singularity/default
Adding dependency `squashfs-tools/4.5.1` to your environment
Adding package `singularity/3.10.0` to your environment
 
$ export SINGULARITY_IMAGE=/opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif
 
$ Sshell
Singularity> rocm-smi
 
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    27.0c  34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

Excellent! Inside the container, the rocm-smi utility is able to see the AMD GPU and read stats from it. Finally, will tensorflow be able to see it?

Singularity> python3
Python 3.9.13 (main, May 23 2022, 22:01:06) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> 

VALET configuration

To make the containers as easy to use for end users, VALET is leveraged:

$ cat /opt/shared/valet/etc/tensorflow.vpkg_yaml
tensorflow:
    prefix: /opt/shared/singularity/images/tensorflow
    description: official TensorFlow containers
    url: "https://hub.docker.com/r/rocm/tensorflow"
 
    actions:
        - variable: SINGULARITY_IMAGE
          action: set
          value: ${VALET_PATH_PREFIX}/tensorflow.sif
 
    dependencies:
        - singularity/default
 
    versions:
        "2.9:rocm":
            description: TF 2.9 with ROCM 5.2.0 AMD GPU support
 
        "2.8:rocm":
            description: TF 2.8 with ROCM 5.2.0 AMD GPU support
 
        "2.3:rocm":
            description: TF 2.3 with ROCM 4.2 AMD GPU support
 
$ vpkg_versions tensorflow
 
Available versions in package (* = default version):
 
[/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml]
tensorflow  official TensorFlow containers
  2.3:rocm  TF 2.3 with ROCM 4.2 AMD GPU support
  2.8:rocm  TF 2.8 with ROCM 5.2.0 AMD GPU support
* 2.9:rocm  TF 2.9 with ROCM 5.2.0 AMD GPU support

End-user usage of 2.9 image

To test a Python script running the the tensorflow container, we first write the Python script:

tf-test.py
#!/usr/bin/env python
 
import tensorflow as tf
 
print('There are {:d} GPU devices visible to this TF'.format(
    len(tf.config.list_physical_devices('GPU')))
  )

The job script requests all memory, all 128 CPU cores, and the Mi100 GPU device:

tf-test.qs
#!/bin/bash
#
#SBATCH --partition=gpu-mi100
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=128
#SBATCH --mem=0
#SBATCH --gpus=amd_mi100:1
#
 
vpkg_require tensorflow/2.9:rocm
 
Srun python3 tf-test.py

These two files were copied to a working directory named tf-2.9-test (in my home directory) and the batch job submitted:

$ cd ~/tf-2.9-test
 
$ ls -l
total 19
-rw-r--r-- 1 frey it_nss 164 Jul  7 12:55 tf-test.py
-rw-r--r-- 1 frey it_nss 214 Jul  7 12:54 tf-test.qs
 
$ sbatch tf-test.qs
Submitted batch job 918657

After the job has run:

$ ls -l
total 29
-rw-r--r-- 1 frey it_nss 220 Jul  7 12:55 slurm-918657.out
-rw-r--r-- 1 frey it_nss 164 Jul  7 12:55 tf-test.py
-rw-r--r-- 1 frey it_nss 214 Jul  7 12:54 tf-test.qs
 
$ cat slurm-918657.out 
Adding dependency `squashfs-tools/4.5.1` to your environment
Adding dependency `singularity/3.10.0` to your environment
Adding package `tensorflow/2.9:rocm` to your environment
There are 1 GPU devices visible to this TF

Hooray!

  • technical/recipes/tensorflow-rocm.txt
  • Last modified: 2022-07-07 12:58
  • by frey