Differences
This shows you the differences between two versions of the page.
| technical:recipes:tensorflow-rocm [2022-07-07 12:56] – created frey | technical:recipes:tensorflow-rocm [2022-07-07 12:58] (current) – frey | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== TensorFlow using AMD GPUs ====== | ||
| + | Complementing the NVIDIA T4 and V100 GPU nodes in the DARWIN cluster are nodes with an AMD Mi50 and an AMD Mi100 GPU coprocessor, | ||
| + | |||
| + | One popular AI/ML framework, TensorFlow, has an official DockerHub container release. | ||
| + | |||
| + | Building the container, however, is a challenge. | ||
| + | |||
| + | * The 2.3 image worked as-is | ||
| + | * The 2.8 and 2.9 images lacked a number of Python modules that prevented tensorflow from loading (numpy, google, protobuf, etc.) | ||
| + | * The blobs that comprise the container images are quite large — I hit my home directory quota thanks to the blobs Singularity cached in '' | ||
| + | |||
| + | <WRAP center round important 60%> | ||
| + | Singularity by default maps the user's home directory to the container. | ||
| + | </ | ||
| + | |||
| + | The solution with the 2.8 and 2.9 images was to initially build them as read+write //sandbox// images, then run them as user root. Root has privileges to write to ''/ | ||
| + | |||
| + | ====== Producing the 2.9 image ====== | ||
| + | |||
| + | After adding Singularity to the runtime environment, | ||
| + | |||
| + | <code bash> | ||
| + | $ vpkg_require singularity/ | ||
| + | Adding dependency `squashfs-tools/ | ||
| + | Adding package `singularity/ | ||
| + | |||
| + | $ mkdir -p / | ||
| + | |||
| + | $ export SINGULARITY_CACHEDIR=" | ||
| + | </ | ||
| + | |||
| + | For the duration of container builds in this shell, the temp directory will hold all cached blobs (rather than putting them in my NFS home directory). | ||
| + | |||
| + | The sandbox image is built thusly: | ||
| + | |||
| + | <code bash> | ||
| + | $ singularity build --sandbox / | ||
| + | docker:// | ||
| + | INFO: Starting build... | ||
| + | Getting image source signatures | ||
| + | Copying blob 8751bf8569be done | ||
| + | Copying blob 8751bf8569be done | ||
| + | : | ||
| + | 2022/07/07 10: | ||
| + | 2022/07/07 10: | ||
| + | 2022/07/07 10: | ||
| + | INFO: Creating sandbox directory... | ||
| + | INFO: Build complete: / | ||
| + | </ | ||
| + | |||
| + | Next, a shell is started in the sandbox container and used to iteratively add missing Python packages. | ||
| + | |||
| + | <code bash> | ||
| + | $ sudo -s | ||
| + | [root]$ singularity shell --pid --ipc --writable / | ||
| + | |||
| + | Singularity> | ||
| + | : | ||
| + | ModuleNotFoundError: | ||
| + | |||
| + | Singularity> | ||
| + | Collecting absl-py | ||
| + | Downloading absl_py-1.1.0-py3-none-any.whl (123 kB) | ||
| + | | ||
| + | : | ||
| + | Installing collected packages: absl-py | ||
| + | Successfully installed absl-py-1.1.0 | ||
| + | |||
| + | Singularity> | ||
| + | |||
| + | : | ||
| + | ModuleNotFoundError: | ||
| + | Singularity> | ||
| + | Collecting numpy | ||
| + | Using cached numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB) | ||
| + | : | ||
| + | |||
| + | Singularity> | ||
| + | 2022-07-07 11: | ||
| + | 2022-07-07 11: | ||
| + | 2022-07-07 11: | ||
| + | 2022-07-07 11: | ||
| + | 2022-07-07 11: | ||
| + | </ | ||
| + | |||
| + | Since the node on which I was building this container does not have an AMD GPU, no GPU is found by the now-working tensorflow module. | ||
| + | |||
| + | <code bash> | ||
| + | [root]$ exit | ||
| + | $ singularity build / | ||
| + | / | ||
| + | INFO: Starting build... | ||
| + | INFO: Creating SIF file... | ||
| + | INFO: Build complete: / | ||
| + | </ | ||
| + | |||
| + | ====== Test the 2.9 image ====== | ||
| + | |||
| + | At this point it's necessary to check that the read-only image sees the AMD GPU device. | ||
| + | |||
| + | <code bash> | ||
| + | $ salloc --partition=idle --gpus=amd_mi100: | ||
| + | salloc: Pending job allocation ###### | ||
| + | salloc: job 918608 queued and waiting for resources | ||
| + | salloc: job 918608 has been allocated resources | ||
| + | salloc: Granted job allocation ###### | ||
| + | salloc: Waiting for resource configuration | ||
| + | salloc: Nodes r0m01 are ready for job | ||
| + | </ | ||
| + | |||
| + | Add Singularity to the runtime environment and reference the 2.9 container image: | ||
| + | |||
| + | <code bash> | ||
| + | $ vpkg_require singularity/ | ||
| + | Adding dependency `squashfs-tools/ | ||
| + | Adding package `singularity/ | ||
| + | |||
| + | $ export SINGULARITY_IMAGE=/ | ||
| + | |||
| + | $ Sshell | ||
| + | Singularity> | ||
| + | |||
| + | ======================= ROCm System Management Interface ======================= | ||
| + | ================================= Concise Info ================================= | ||
| + | GPU Temp | ||
| + | 0 27.0c 34.0W | ||
| + | ================================================================================ | ||
| + | ============================= End of ROCm SMI Log ============================== | ||
| + | </ | ||
| + | |||
| + | Excellent! | ||
| + | |||
| + | <code bash> | ||
| + | Singularity> | ||
| + | Python 3.9.13 (main, May 23 2022, 22: | ||
| + | [GCC 9.4.0] on linux | ||
| + | Type " | ||
| + | >>> | ||
| + | >>> | ||
| + | [PhysicalDevice(name='/ | ||
| + | >>> | ||
| + | </ | ||
| + | |||
| + | ====== VALET configuration ====== | ||
| + | |||
| + | To make the containers as easy to use for end users, VALET is leveraged: | ||
| + | |||
| + | <code bash> | ||
| + | $ cat / | ||
| + | tensorflow: | ||
| + | prefix: / | ||
| + | description: | ||
| + | url: " | ||
| + | |||
| + | actions: | ||
| + | - variable: SINGULARITY_IMAGE | ||
| + | action: set | ||
| + | value: ${VALET_PATH_PREFIX}/ | ||
| + | |||
| + | dependencies: | ||
| + | - singularity/ | ||
| + | |||
| + | versions: | ||
| + | " | ||
| + | description: | ||
| + | |||
| + | " | ||
| + | description: | ||
| + | |||
| + | " | ||
| + | description: | ||
| + | |||
| + | $ vpkg_versions tensorflow | ||
| + | |||
| + | Available versions in package (* = default version): | ||
| + | |||
| + | [/ | ||
| + | tensorflow | ||
| + | 2.3: | ||
| + | 2.8: | ||
| + | * 2.9: | ||
| + | </ | ||
| + | |||
| + | ====== End-user usage of 2.9 image ====== | ||
| + | |||
| + | To test a Python script running the the tensorflow container, we first write the Python script: | ||
| + | |||
| + | <file python tf-test.py> | ||
| + | # | ||
| + | |||
| + | import tensorflow as tf | ||
| + | |||
| + | print(' | ||
| + | len(tf.config.list_physical_devices(' | ||
| + | ) | ||
| + | |||
| + | </ | ||
| + | |||
| + | The job script requests all memory, all 128 CPU cores, and the Mi100 GPU device: | ||
| + | |||
| + | <file bash tf-test.qs> | ||
| + | #!/bin/bash | ||
| + | # | ||
| + | #SBATCH --partition=gpu-mi100 | ||
| + | #SBATCH --nodes=1 | ||
| + | #SBATCH --ntasks=1 | ||
| + | #SBATCH --cpus-per-task=128 | ||
| + | #SBATCH --mem=0 | ||
| + | #SBATCH --gpus=amd_mi100: | ||
| + | # | ||
| + | |||
| + | vpkg_require tensorflow/ | ||
| + | |||
| + | Srun python3 tf-test.py | ||
| + | |||
| + | </ | ||
| + | |||
| + | These two files were copied to a working directory named '' | ||
| + | |||
| + | <code bash> | ||
| + | $ cd ~/ | ||
| + | |||
| + | $ ls -l | ||
| + | total 19 | ||
| + | -rw-r--r-- 1 frey it_nss 164 Jul 7 12:55 tf-test.py | ||
| + | -rw-r--r-- 1 frey it_nss 214 Jul 7 12:54 tf-test.qs | ||
| + | |||
| + | $ sbatch tf-test.qs | ||
| + | Submitted batch job 918657 | ||
| + | </ | ||
| + | |||
| + | After the job has run: | ||
| + | |||
| + | <code bash> | ||
| + | $ ls -l | ||
| + | total 29 | ||
| + | -rw-r--r-- 1 frey it_nss 220 Jul 7 12:55 slurm-918657.out | ||
| + | -rw-r--r-- 1 frey it_nss 164 Jul 7 12:55 tf-test.py | ||
| + | -rw-r--r-- 1 frey it_nss 214 Jul 7 12:54 tf-test.qs | ||
| + | |||
| + | $ cat slurm-918657.out | ||
| + | Adding dependency `squashfs-tools/ | ||
| + | Adding dependency `singularity/ | ||
| + | Adding package `tensorflow/ | ||
| + | There are 1 GPU devices visible to this TF | ||
| + | </ | ||
| + | |||
| + | Hooray! | ||