AWS Questions regarding, Multi-GPU, Trainium & Inferentia #8433

asutermo · 2024-06-07T22:11:29Z

asutermo
Jun 7, 2024

Hi there,

I have a process that finetunes SDXL w/ ControlNet on SageMaker using the AWS + HuggingFace estimator. Most of the materials were written for Transformers, but I've largely gotten this to work.

With the exception of Multi-gpu. I'm able to run the estimator, by referencing the diffusers git url. Unfortunately this also seems to bypass invoking accelerate. This is all python (not jupyter, not a sagemaker notebook).

        est = HuggingFace(
            entry_point="examples/controlnet/train_controlnet_sdxl.py",
            git_config=git_config,
            image_uri=image_uri,
            ...

How can I get multi-gpu to work. All of the transformers notebooks suggest this:

distribution = {"smdistributed": {"dataparallel": {"enabled": True}}}

but it appears to have no effect. I've also dabbled with mpi per other notebooks, but that causes other issues

There's little documentation on the subject of Trainium and Inferentia. Is it possible to finetune ControlNet (SDXL) on Trainium?
I was going to ask about Inferentia, but I forgot I had already posted in optimum-neuron, and it appears there's a PR there for those interested: Add Stable Diffusion ControlNet support optimum-neuron#622

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Questions regarding, Multi-GPU, Trainium & Inferentia #8433

{{title}}

Replies: 0 comments

Select a reply

AWS Questions regarding, Multi-GPU, Trainium & Inferentia #8433

asutermo Jun 7, 2024

Replies: 0 comments

asutermo
Jun 7, 2024