[REQUEST] Support for XLA/TPU #6901

radna0 · 2024-12-21T05:29:37Z

Is your feature request related to a problem? Please describe.
Currently, DeepSpeed lacks support for TPUs via the XLA backend. This limits the use of DeepSpeed's advanced parallelism techniques, such as pipeline parallelism and ZeRO optimizations, for TPU users. Frameworks like PyTorch/XLA and Accelerate offer TPU support, but they lack the comprehensive optimization features that DeepSpeed provides.

This is particularly frustrating for users who want to scale models efficiently on TPUs across multiple nodes while leveraging DeepSpeed's features.

Describe the solution you'd like
I propose integrating XLA as a backend for DeepSpeed, enabling TPU users to take advantage of DeepSpeed's optimizations, including pipeline parallelism, ZeRO, and advanced scheduling mechanisms.

Describe alternatives you've considered

PyTorch/XLA + Torchrun: While it provides basic TPU support, it doesn't include DeepSpeed's advanced features.
HuggingFace Accelerate: Offers TPU support but lacks pipeline parallelism and other optimization techniques available in DeepSpeed.

Additional context
There is growing interest in having TPU support in DeepSpeed, as evidenced by multiple community requests. Adding XLA as a backend would make DeepSpeed accessible to a wider audience, particularly researchers and engineers working with TPUs.

I'm willing to lead this integration if needed, and I hope this request sparks discussion and collaboration within the DeepSpeed team.

Link to Pytorch XLA Feature Request: pytorch/xla#8514 (comment)

tjruwase · 2024-12-24T12:37:42Z

@radna0, this sounds exciting. It is awesome that you are willing to lead the integration. We will be glad to collaborate with you to make this happen.

radna0 · 2024-12-24T13:23:31Z

Thank you @tjruwase . I currently have a setup somewhat running. Here are some things I did and some roadblocks I have.

Basics:

I added XLA_Accelerator class within xla_accelerator.py, this then will be used in get_accelerator() in real_accelerator.py. All the apis for the basics such as device, communication backend, metrics, RNG APIs, etc. are using Pytorch XLA and works with TPUs. But the ops are using ones from the CPU, this works fine I think and the xla ops can be added later. Right now I just want something minimal to get working with Pytorch XLA on TPUs.

Roadblock 1:

When it comes to the DeepSpeed launcher. XLA only allows one single parent process when doing muiti process spawn, but when using DeepSpeed launcher, For example deepspeed train.py, it first runs the runner.py file and then calls launcher.py to spawn in the processes. XLA will see the first process here as runner,py. So here is the workaround that I have.

Run with: DS_ACCELERATOR=cpuxla deepspeed train.py.
What it does is, sets the DS_ACCELERATOR to cpu within runner.py. Before running launcher.py from runner.py, we change DS_ACCELERATOR to xla. Now this way, the parent process will be launcher.py instead of runner.py.

so it would just act/behave like the following for DS_ACCELERATOR=cpuxla
DS_ACCELERATOR=cpu python deepspeed/runner.py
DS_ACCELERATOR=xla python deepspeed/launcher.py

Roadblock 2:

After getting to launcher.py, the processes are spawned with subprocess.Popen. This is also not compatible with Pytorch XLA, as only one process can access the devices. The only way seems to be using the torch_xla.distributed.xla_multiprocessing.spawn. I don't know how subprocess.Popen differ with torch_xla spawn method, but Popen seems to spawn an entire child process, while torch_xla spawn seems to fork the parent process.

I haven't tested out other methods more clearly like os.fork. But it is stated by the pytorch xla team within their documentation that multiprocess can only be spawned with the pytorch xla spawn method.

Updates

I'm currently testing a training script to see if multiprocess, pipeline parallelism, etc works properly, but will update if I have any findings.

tjruwase · 2024-12-24T14:31:21Z

Run with: DS_ACCELERATOR=cpuxla deepspeed train.py.
What it does is, sets the DS_ACCELERATOR to cpu within runner.py. Before running launcher.py from runner.py, we change DS_ACCELERATOR to xla. Now this way, the parent process will be launcher.py instead of runner.py.

I am curious about the implementation of the above. Are you handling when DS_ACCELERATOR is not available?

I assume you have seen these tutorials:

tjruwase · 2024-12-24T14:33:52Z

I don't know how subprocess.Popen differ with torch_xla spawn method, but Popen seems to spawn an entire child process, while torch_xla spawn seems to fork the parent process.

Does it make sense to add the spawn method into the accelerator abstraction class?

tjruwase · 2024-12-24T14:38:04Z

Another idea for roadblock 1 is to add DS_ACCELERATOR=xla into .deepspeed_env, rather than setting on the command line. Please see this description of .deepspeed_env usage.

radna0 · 2024-12-24T15:27:58Z

Run with: DS_ACCELERATOR=cpuxla deepspeed train.py.
What it does is, sets the DS_ACCELERATOR to cpu within runner.py. Before running launcher.py from runner.py, we change DS_ACCELERATOR to xla. Now this way, the parent process will be launcher.py instead of runner.py.

I am curious about the implementation of the above. Are you handling when DS_ACCELERATOR is not available?

I assume you have seen these tutorials:

https://www.deepspeed.ai/tutorials/accelerator-abstraction-interface/

https://www.deepspeed.ai/tutorials/accelerator-setup-guide/

No I'm not handling when accelerator is none. and yes when not available it defaults to CPU or other accelerator. I will add a check for xla

I haven't seen those tutorials. But my implementation of the xla_accelerator is based on the cpu and cuda accelerator. So it is actually compatible and working just as the tutorial suggested.

As for the setup guide for Pytorch XLA, I can document it as I open my PR to merge the changes. Or I can do that right away if needed? How can I do so?

I don't know how subprocess.Popen differ with torch_xla spawn method, but Popen seems to spawn an entire child process, while torch_xla spawn seems to fork the parent process.

Does it make sense to add the spawn method into the accelerator abstraction class?

Yes, I think this would be ideal, I'm currently just checking the accelerator on the fly within the launcher.py to handle spawning processes accordingly.

Another idea for roadblock 1 is to add DS_ACCELERATOR=xla into .deepspeed_env, rather than setting on the command line. Please see this description of .deepspeed_env usage.

I'm not sure if I'm doing it correctly, I created .deepspeed_env file in the local directory, the content is only DS_ACCELERATOR=xla. But it doesn't seem to work launching with both these commands. It only uses CPU as the accelerator

deepspeed train.py
DS_ENV_FILE=path_to_env/.deepspeed_env deepspeed train.py

tjruwase · 2024-12-24T15:32:29Z

it doesn't seem to work launching with both these commands. It only uses CPU as the accelerator

Hmm, that is strange. Can you examine this code to see what is wrong?

DeepSpeed/deepspeed/launcher/runner.py

Lines 593 to 600 in eea5304

    
           for environ_path in DEEPSPEED_ENVIRONMENT_PATHS: 
        
               environ_file = os.path.join(environ_path, DEEPSPEED_ENVIRONMENT_NAME) 
        
               if os.path.isfile(environ_file): 
        
                   logger.info(f"deepspeed_env file = {environ_file}") 
        
                   with open(environ_file, 'r') as fd: 
        
                       for var in fd.readlines(): 
        
                           key, val = var.split('=', maxsplit=1) 
        
                           runner.add_export(key, val)

tjruwase · 2024-12-24T16:11:28Z

deepspeed train.py

DS_ENV_FILE=path_to_env/.deepspeed_env deepspeed train.py

Try adding --force_multi command line argument, i.e., deepspeed --force_multi ...

radna0 · 2024-12-24T16:26:07Z

@tjruwase I tried all kinds of commands, all just uses CPU, and with --force_multi it results in connection error. But I don't think this ENV check should just be in multi node setting. Can we have that for both multi_node_exec and not?

Also for now I'm leaving out the xla check if accelerator_name is None. As it needs the CPU within the first process, runner.py. and then launcher.py can use xla. that checks then breaks if DS_ACCELERATOR=cpuxla is not set. which should work with CPU by default in my opinion.

tjruwase · 2024-12-24T16:29:01Z

@tjruwase I tried all kinds of commands, all just uses CPU, and with --force_multi it results in connection error. But I don't think this ENV check should just be in multi node setting. Can we have that for both multi_node_exec and not?

Agreed. Will fix.

tjruwase · 2024-12-24T16:32:11Z

Also I'm leaving out the xla check if accelerator_name is None. As it needs the CPU within the first process, runner.py. and then launcher.py can use xla. that checks then breaks if DS_ACCELERATOR=cpuxla is not set. which should work with CPU by default in my opinion.

CPU should be the fall-through if no accelerator is detected or specified. So, it seems your use case is exposing an issue here. Is it possible to share a stack trace of the failure?

tjruwase · 2024-12-24T16:42:07Z

so it would just act/behave like the following for DS_ACCELERATOR=cpuxla
DS_ACCELERATOR=cpu python deepspeed/runner.py

On further thought, I think the above strategy will create problems in runner.py for logic such as

Something to keep in mind.

radna0 · 2024-12-24T17:14:30Z

Also I'm leaving out the xla check if accelerator_name is None. As it needs the CPU within the first process, runner.py. and then launcher.py can use xla. that checks then breaks if DS_ACCELERATOR=cpuxla is not set. which should work with CPU by default in my opinion.

CPU should be the fall-through if no accelerator is detected or specified. So, it seems your use case is exposing an issue here. Is it possible to share a stack trace of the failure?

There are two ways we can try setting accelerator_name for xla:
*Note torch_xla has no checks for available device by default, we rely on import error

setting accelerator_name = "xla" => We hit Roadblock 1. runner.py can only use CPU for xla accelerator to work, if not we can't access the xla device at all outside of runner.py
setting accelerator_name = "cpuxla", will not work as well, because it only sets accelerator for get_accelerator, maybe if accelerator_name is None, we can also set the global DS_ACCELERATOR env. If this happens then 'cpuxla' will work as I stated above

# TODO: This will not work right now, only with cpuxla env  
# cpu first for runner.py and xla set for launcher.py 
if accelerator_name is None:
    try:
        import torch_xla  # noqa: F401,F811

        # torch_xla has no check such as torch_xla.is_available()
        accelerator_name = "cpuxla" # or "xla"
    except ImportError as e:
        pass

so it would just act/behave like the following for DS_ACCELERATOR=cpuxla
DS_ACCELERATOR=cpu python deepspeed/runner.py

On further thought, I think the above strategy will create problems in runner.py for logic such as

This

This

This

Something to keep in mind.

With DS_ACCELERATOR=cpuxla

visible_devices_envs() does not affect how cpuxla works, I see that when running this way, it sets the CUDA_VISIBLE_DEVICES but everything runs fine.
device_count() will be a problem here, as cpuxla relies on the cpu_accelerator.py. I believe any pytorch xla imports will initializes it as the parent process. This though can be resolved by having cpuxla_accelerator.py, which still use most of basics API by CPU, but device_count() needs to be found through its metrics package tpu-info. Its has apis for detecting basics info like device and memory usage without initializing the xla runtime.
export_envs() also does not affect this. I could potentially set PJRT_DEVICE=TPU env, but XLA works with GPU as well. Also the env will be automatically detected by pytorch xla.

radna0 added the enhancement New feature or request label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Support for XLA/TPU #6901

[REQUEST] Support for XLA/TPU #6901

radna0 commented Dec 21, 2024

tjruwase commented Dec 24, 2024

radna0 commented Dec 24, 2024 •

edited

Loading

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

radna0 commented Dec 24, 2024 •

edited

Loading

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024 •

edited

Loading

radna0 commented Dec 24, 2024 •

edited

Loading

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

radna0 commented Dec 24, 2024 •

edited

Loading

[REQUEST] Support for XLA/TPU #6901

[REQUEST] Support for XLA/TPU #6901

Comments

radna0 commented Dec 21, 2024

tjruwase commented Dec 24, 2024

radna0 commented Dec 24, 2024 • edited Loading

Basics:

Roadblock 1:

Roadblock 2:

Updates

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

radna0 commented Dec 24, 2024 • edited Loading

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024 • edited Loading

radna0 commented Dec 24, 2024 • edited Loading

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

tjruwase commented Dec 24, 2024

radna0 commented Dec 24, 2024 • edited Loading

radna0 commented Dec 24, 2024 •

edited

Loading

radna0 commented Dec 24, 2024 •

edited

Loading

tjruwase commented Dec 24, 2024 •

edited

Loading

radna0 commented Dec 24, 2024 •

edited

Loading

radna0 commented Dec 24, 2024 •

edited

Loading