Support building GPU docker image for MaxDiffusion Model #121

sizhit2 · 2024-10-08T21:27:56Z

Add device option in docker_build_dependency_image.sh. The default option is still Tpu. To build docker image with GPU, just specify bash docker_build_dependency_image.sh DEVICE=gpu
Add maxdiffusion_gpu_dependencies.Dockerfile .

parambole

LGTM. Thanks for adding this functionality.

Nit: Can you please update the README with the instructions?

parambole · 2024-10-08T22:02:58Z

maxdiffusion_gpu_dependencies.Dockerfile

@@ -0,0 +1,49 @@
+# syntax=docker/dockerfile:experimental
+ARG BASEIMAGE=ghcr.io/nvidia/jax:base


@sizhit2

Nit:

We can probaly add a link from where the base image is being dervice or a description even though navigating using the base image link would be sufficient ) i.e. why is this the default
Also probably a Note that this pulls in the latest of the jax:base i.e. built nightly AFAIR and that the tag changes so only the tag ( dated image on that day is stable i.e. does not chage )

Super Nit:

If you feel like contributing back to Jax tool box build process you could add a maxdiffusion test since it now depends on jax toolbox base as one of the CI/CD test.

wang2yn84

Thank you for adding the GPU support! Left couple of comments.

wang2yn84 · 2024-10-12T00:02:04Z

docker_build_dependency_image.sh

@@ -46,6 +46,11 @@ if [[ -z ${MODE} ]]; then
  echo "Default MODE=${MODE}"
 fi

+if [[ -z ${DEVICE} ]]; then
+  export DEVICE=tpu
+  echo "Default DEVICE=${DEVICE}"


Can we move this echo outside of the if statement so that it always prints the default device type?

Wouldn't it be confusing if we set DEVICE=gpu, but we are echoing Default DEVICE=tpu? Or do you suggest we always echo the device information we are currently using?

We won't get the device as TPU if we set to GPU. This piece of code only sets the device type as TPU is no device type is assigned. So if you assign device type as GPU, you will get GPU.

I see what you are suggesting. I add additional line to always print current DEVICE.

wang2yn84 · 2024-10-12T00:09:24Z

docker_build_dependency_image.sh

-  if [[ ! -v BASEIMAGE ]]; then
-    echo "Erroring out because BASEIMAGE is unset, please set it!"
-    exit 1
+if [[ ${DEVICE} == "gpu" ]]; then


Can you added a "pinned" mode here?

…etup, add hardware gpu option in yml, add jax multi-host support for gpu

…ment to fix import error, add jax[cuda] install instruction more non-pinned mode when device is GPU

setup.sh

constraints_gpu.txt

setup.sh

src/maxdiffusion/max_utils.py

parambole · 2024-10-28T20:17:59Z

gpu_multi_process_run.sh

+set_nccl_gpudirect_tcpx_specific_configuration() {
+  if [[ "$USE_GPUDIRECT" == "tcpx" ]] || [[ "$USE_GPUDIRECT" == "fastrak" ]]; then
+    export CUDA_DEVICE_MAX_CONNECTIONS=1
+    export NCCL_CROSS_NIC=0
+    export NCCL_DEBUG=INFO
+    export NCCL_DYNAMIC_CHUNK_SIZE=524288
+    export NCCL_NET_GDR_LEVEL=PIX
+    export NCCL_NVLS_ENABLE=0
+    export NCCL_P2P_NET_CHUNKSIZE=524288
+    export NCCL_P2P_NVL_CHUNKSIZE=1048576
+    export NCCL_P2P_PCI_CHUNKSIZE=524288
+    export NCCL_PROTO=Simple
+    export NCCL_SOCKET_IFNAME=eth0
+    export NVTE_FUSED_ATTN=1
+    export TF_CPP_MAX_LOG_LEVEL=100
+    export TF_CPP_VMODULE=profile_guided_latency_estimator=10
+    export XLA_PYTHON_CLIENT_MEM_FRACTION=0.85
+    shopt -s globstar nullglob
+    IFS=:$IFS
+    set -- /usr/local/cuda-*/compat
+    export LD_LIBRARY_PATH="${1+:"$*"}:${LD_LIBRARY_PATH}:/usr/local/tcpx/lib64"
+    IFS=${IFS#?}
+    shopt -u globstar nullglob
+
+    if [[ "$USE_GPUDIRECT" == "tcpx" ]]; then
+      echo "Using GPUDirect-TCPX"
+      export NCCL_ALGO=Ring
+      export NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING,NET,VERSION
+      export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
+      export NCCL_GPUDIRECTTCPX_FORCE_ACK=0
+      export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=1000000
+      export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,124-139;eth2:22-35,124-139;eth3:74-87,178-191;eth4:74-87,178-191"
+      export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
+      export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
+      export NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000
+      export NCCL_MAX_NCHANNELS=12
+      export NCCL_MIN_NCHANNELS=12
+      export NCCL_NSOCKS_PERTHREAD=4
+      export NCCL_P2P_PXN_LEVEL=0
+      export NCCL_SOCKET_NTHREADS=1
+    elif [[ "$USE_GPUDIRECT" == "fastrak" ]]; then
+      echo "Using GPUDirect-TCPFasTrak"
+      export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+      export NCCL_ALGO=Ring,Tree
+      export NCCL_BUFFSIZE=8388608
+      export NCCL_FASTRAK_CTRL_DEV=eth0
+      export NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL=0
+      export NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING=0
+      export NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
+      export NCCL_FASTRAK_NUM_FLOWS=2
+      export NCCL_FASTRAK_USE_LLCM=1
+      export NCCL_FASTRAK_USE_SNAP=1
+      export NCCL_MIN_NCHANNELS=4
+      export NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/usr/local/nvidia/lib64/a3plus_guest_config.textproto
+      export NCCL_TUNER_CONFIG_PATH=/usr/local/nvidia/lib64/a3plus_tuner_config.textproto
+      export NCCL_TUNER_PLUGIN=libnccl-tuner.so
+    fi
+  else
+    echo "NOT using GPUDirect"
+  fi
+}


@sizhit2 Currently are we deploying the workload using XPK ? I am trying to understand which components determines the source of through for these flags.

@parambole , yes we are deploying through xpk. It will execute this file and execute this function. Can you elaborate on "which components determines the source of through for these flags"? What does it mean?

@wang2yn84 Oh I see in the past there was a conversation around XPK setting the required env variables. But as you mentioned I still see XPK calling this file for that.

ref: https://github.com/AI-Hypercomputer/xpk/blob/d88b092dc71a9d3f6d06fc8984370de0fbcbbe51/src/xpk/core/core.py#L2133

"which components determines the source of through for these flags"?

Here the two components are XPK and the workload I thought XPK was taking care of setting the envvar and all the other setup related to running a workload on A3 series like installing the NCCL plugin and running the side-car container.

I see. Do you have a pointer to that? What's the conclusion?

I see. Do you have a pointer to that? What's the conclusion?

I am not sure what the conclusion was. Probably @gobbleturk might have more context.

sizhit2 added 2 commits October 8, 2024 20:57

adding gpu docker file

42e8d10

Merge branch 'main' into sizhi_gpu

bb182c4

parambole previously approved these changes Oct 8, 2024

View reviewed changes

parambole assigned sizhit2 Oct 8, 2024

parambole added the enhancement New feature or request label Oct 8, 2024

wang2yn84 reviewed Oct 12, 2024

View reviewed changes

add more gpu dependency files, unify working directory to match xpk s…

af65f68

…etup, add hardware gpu option in yml, add jax multi-host support for gpu

sizhit2 dismissed parambole’s stale review via af65f68 October 21, 2024 23:43

sizhit2 added 4 commits October 21, 2024 23:47

fix identation

c6030e9

reformatting

44b4801

add gpu_multi_process_run.sh, unify working directory, update require…

0a15164

…ment to fix import error, add jax[cuda] install instruction more non-pinned mode when device is GPU

reformatting

0e475bd

sizhit2 changed the title ~~Adding gpu docker dependency file~~ Adding GPU support for MaxDiffusion Model Oct 23, 2024

wang2yn84 reviewed Oct 25, 2024

View reviewed changes

setup.sh Outdated Show resolved Hide resolved

constraints_gpu.txt Outdated Show resolved Hide resolved

constraints_gpu.txt Outdated Show resolved Hide resolved

sizhit2 changed the title ~~Adding GPU support for MaxDiffusion Model~~ Support building GPU docker image for MaxDiffusion Model Oct 28, 2024

resolve comments

e5f76c1

parambole reviewed Oct 28, 2024

View reviewed changes

setup.sh Outdated Show resolved Hide resolved

parambole reviewed Oct 28, 2024

View reviewed changes

src/maxdiffusion/max_utils.py Show resolved Hide resolved

parambole reviewed Oct 28, 2024

View reviewed changes

delete gpu pinned mode

a89d21c

parambole approved these changes Oct 28, 2024

View reviewed changes

parambole merged commit 51e1db1 into main Oct 29, 2024
3 checks passed

parambole mentioned this pull request Nov 6, 2024

Fixes build image failure #132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support building GPU docker image for MaxDiffusion Model #121

Support building GPU docker image for MaxDiffusion Model #121

sizhit2 commented Oct 8, 2024

parambole left a comment •

edited

Loading

parambole Oct 8, 2024

wang2yn84 left a comment

wang2yn84 Oct 12, 2024

sizhit2 Oct 14, 2024

wang2yn84 Oct 14, 2024

sizhit2 Oct 21, 2024

wang2yn84 Oct 12, 2024

sizhit2 Oct 21, 2024

parambole Oct 28, 2024

wang2yn84 Oct 28, 2024

parambole Oct 28, 2024

wang2yn84 Oct 28, 2024

parambole Oct 29, 2024

		@@ -0,0 +1,49 @@
		# syntax=docker/dockerfile:experimental
		ARG BASEIMAGE=ghcr.io/nvidia/jax:base

Support building GPU docker image for MaxDiffusion Model #121

Support building GPU docker image for MaxDiffusion Model #121

Conversation

sizhit2 commented Oct 8, 2024

parambole left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wang2yn84 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parambole left a comment •

edited

Loading