Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Support for XLA/TPU #6901

Open
radna0 opened this issue Dec 21, 2024 · 13 comments
Open

[REQUEST] Support for XLA/TPU #6901

radna0 opened this issue Dec 21, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@radna0
Copy link

radna0 commented Dec 21, 2024

Is your feature request related to a problem? Please describe.
Currently, DeepSpeed lacks support for TPUs via the XLA backend. This limits the use of DeepSpeed's advanced parallelism techniques, such as pipeline parallelism and ZeRO optimizations, for TPU users. Frameworks like PyTorch/XLA and Accelerate offer TPU support, but they lack the comprehensive optimization features that DeepSpeed provides.

This is particularly frustrating for users who want to scale models efficiently on TPUs across multiple nodes while leveraging DeepSpeed's features.

Describe the solution you'd like
I propose integrating XLA as a backend for DeepSpeed, enabling TPU users to take advantage of DeepSpeed's optimizations, including pipeline parallelism, ZeRO, and advanced scheduling mechanisms.

Describe alternatives you've considered

  • PyTorch/XLA + Torchrun: While it provides basic TPU support, it doesn't include DeepSpeed's advanced features.
  • HuggingFace Accelerate: Offers TPU support but lacks pipeline parallelism and other optimization techniques available in DeepSpeed.

Additional context
There is growing interest in having TPU support in DeepSpeed, as evidenced by multiple community requests. Adding XLA as a backend would make DeepSpeed accessible to a wider audience, particularly researchers and engineers working with TPUs.

I'm willing to lead this integration if needed, and I hope this request sparks discussion and collaboration within the DeepSpeed team.

Link to Pytorch XLA Feature Request: pytorch/xla#8514 (comment)

@radna0 radna0 added the enhancement New feature or request label Dec 21, 2024
@tjruwase
Copy link
Contributor

@radna0, this sounds exciting. It is awesome that you are willing to lead the integration. We will be glad to collaborate with you to make this happen.

@radna0
Copy link
Author

radna0 commented Dec 24, 2024

Thank you @tjruwase . I currently have a setup somewhat running. Here are some things I did and some roadblocks I have.

Basics:

I added XLA_Accelerator class within xla_accelerator.py, this then will be used in get_accelerator() in real_accelerator.py. All the apis for the basics such as device, communication backend, metrics, RNG APIs, etc. are using Pytorch XLA and works with TPUs. But the ops are using ones from the CPU, this works fine I think and the xla ops can be added later. Right now I just want something minimal to get working with Pytorch XLA on TPUs.

Roadblock 1:

When it comes to the DeepSpeed launcher. XLA only allows one single parent process when doing muiti process spawn, but when using DeepSpeed launcher, For example deepspeed train.py, it first runs the runner.py file and then calls launcher.py to spawn in the processes. XLA will see the first process here as runner,py. So here is the workaround that I have.

Run with: DS_ACCELERATOR=cpuxla deepspeed train.py.
What it does is, sets the DS_ACCELERATOR to cpu within runner.py. Before running launcher.py from runner.py, we change DS_ACCELERATOR to xla. Now this way, the parent process will be launcher.py instead of runner.py.

so it would just act/behave like the following for DS_ACCELERATOR=cpuxla
DS_ACCELERATOR=cpu python deepspeed/runner.py
DS_ACCELERATOR=xla python deepspeed/launcher.py

Roadblock 2:

After getting to launcher.py, the processes are spawned with subprocess.Popen. This is also not compatible with Pytorch XLA, as only one process can access the devices. The only way seems to be using the torch_xla.distributed.xla_multiprocessing.spawn. I don't know how subprocess.Popen differ with torch_xla spawn method, but Popen seems to spawn an entire child process, while torch_xla spawn seems to fork the parent process.

I haven't tested out other methods more clearly like os.fork. But it is stated by the pytorch xla team within their documentation that multiprocess can only be spawned with the pytorch xla spawn method.

Updates

I'm currently testing a training script to see if multiprocess, pipeline parallelism, etc works properly, but will update if I have any findings.

@tjruwase
Copy link
Contributor

Run with: DS_ACCELERATOR=cpuxla deepspeed train.py.
What it does is, sets the DS_ACCELERATOR to cpu within runner.py. Before running launcher.py from runner.py, we change DS_ACCELERATOR to xla. Now this way, the parent process will be launcher.py instead of runner.py.

I am curious about the implementation of the above. Are you handling when DS_ACCELERATOR is not available?

I assume you have seen these tutorials:

  1. https://www.deepspeed.ai/tutorials/accelerator-abstraction-interface/
  2. https://www.deepspeed.ai/tutorials/accelerator-setup-guide/

@tjruwase
Copy link
Contributor

I don't know how subprocess.Popen differ with torch_xla spawn method, but Popen seems to spawn an entire child process, while torch_xla spawn seems to fork the parent process.

Does it make sense to add the spawn method into the accelerator abstraction class?

@tjruwase
Copy link
Contributor

Another idea for roadblock 1 is to add DS_ACCELERATOR=xla into .deepspeed_env, rather than setting on the command line. Please see this description of .deepspeed_env usage.

@radna0
Copy link
Author

radna0 commented Dec 24, 2024

Run with: DS_ACCELERATOR=cpuxla deepspeed train.py.
What it does is, sets the DS_ACCELERATOR to cpu within runner.py. Before running launcher.py from runner.py, we change DS_ACCELERATOR to xla. Now this way, the parent process will be launcher.py instead of runner.py.

I am curious about the implementation of the above. Are you handling when DS_ACCELERATOR is not available?

I assume you have seen these tutorials:

  1. https://www.deepspeed.ai/tutorials/accelerator-abstraction-interface/
  2. https://www.deepspeed.ai/tutorials/accelerator-setup-guide/

No I'm not handling when accelerator is none. and yes when not available it defaults to CPU or other accelerator. I will add a check for xla

I haven't seen those tutorials. But my implementation of the xla_accelerator is based on the cpu and cuda accelerator. So it is actually compatible and working just as the tutorial suggested.

As for the setup guide for Pytorch XLA, I can document it as I open my PR to merge the changes. Or I can do that right away if needed? How can I do so?


I don't know how subprocess.Popen differ with torch_xla spawn method, but Popen seems to spawn an entire child process, while torch_xla spawn seems to fork the parent process.

Does it make sense to add the spawn method into the accelerator abstraction class?

Yes, I think this would be ideal, I'm currently just checking the accelerator on the fly within the launcher.py to handle spawning processes accordingly.


Another idea for roadblock 1 is to add DS_ACCELERATOR=xla into .deepspeed_env, rather than setting on the command line. Please see this description of .deepspeed_env usage.

I'm not sure if I'm doing it correctly, I created .deepspeed_env file in the local directory, the content is only DS_ACCELERATOR=xla. But it doesn't seem to work launching with both these commands. It only uses CPU as the accelerator

  • deepspeed train.py
  • DS_ENV_FILE=path_to_env/.deepspeed_env deepspeed train.py

@tjruwase
Copy link
Contributor

it doesn't seem to work launching with both these commands. It only uses CPU as the accelerator

Hmm, that is strange. Can you examine this code to see what is wrong?

for environ_path in DEEPSPEED_ENVIRONMENT_PATHS:
environ_file = os.path.join(environ_path, DEEPSPEED_ENVIRONMENT_NAME)
if os.path.isfile(environ_file):
logger.info(f"deepspeed_env file = {environ_file}")
with open(environ_file, 'r') as fd:
for var in fd.readlines():
key, val = var.split('=', maxsplit=1)
runner.add_export(key, val)

@tjruwase
Copy link
Contributor

tjruwase commented Dec 24, 2024

  • deepspeed train.py
  • DS_ENV_FILE=path_to_env/.deepspeed_env deepspeed train.py

Try adding --force_multi command line argument, i.e., deepspeed --force_multi ...

@radna0
Copy link
Author

radna0 commented Dec 24, 2024

@tjruwase I tried all kinds of commands, all just uses CPU, and with --force_multi it results in connection error. But I don't think this ENV check should just be in multi node setting. Can we have that for both multi_node_exec and not?

Also for now I'm leaving out the xla check if accelerator_name is None. As it needs the CPU within the first process, runner.py. and then launcher.py can use xla. that checks then breaks if DS_ACCELERATOR=cpuxla is not set. which should work with CPU by default in my opinion.

@tjruwase
Copy link
Contributor

@tjruwase I tried all kinds of commands, all just uses CPU, and with --force_multi it results in connection error. But I don't think this ENV check should just be in multi node setting. Can we have that for both multi_node_exec and not?

Agreed. Will fix.

@tjruwase
Copy link
Contributor

Also I'm leaving out the xla check if accelerator_name is None. As it needs the CPU within the first process, runner.py. and then launcher.py can use xla. that checks then breaks if DS_ACCELERATOR=cpuxla is not set. which should work with CPU by default in my opinion.

CPU should be the fall-through if no accelerator is detected or specified. So, it seems your use case is exposing an issue here. Is it possible to share a stack trace of the failure?

@tjruwase
Copy link
Contributor

so it would just act/behave like the following for DS_ACCELERATOR=cpuxla
DS_ACCELERATOR=cpu python deepspeed/runner.py

On further thought, I think the above strategy will create problems in runner.py for logic such as

  1. This
  2. This
  3. This

Something to keep in mind.

@radna0
Copy link
Author

radna0 commented Dec 24, 2024

Also I'm leaving out the xla check if accelerator_name is None. As it needs the CPU within the first process, runner.py. and then launcher.py can use xla. that checks then breaks if DS_ACCELERATOR=cpuxla is not set. which should work with CPU by default in my opinion.

CPU should be the fall-through if no accelerator is detected or specified. So, it seems your use case is exposing an issue here. Is it possible to share a stack trace of the failure?

There are two ways we can try setting accelerator_name for xla:
*Note torch_xla has no checks for available device by default, we rely on import error

  • setting accelerator_name = "xla" => We hit Roadblock 1. runner.py can only use CPU for xla accelerator to work, if not we can't access the xla device at all outside of runner.py

  • setting accelerator_name = "cpuxla", will not work as well, because it only sets accelerator for get_accelerator, maybe if accelerator_name is None, we can also set the global DS_ACCELERATOR env. If this happens then 'cpuxla' will work as I stated above

# TODO: This will not work right now, only with cpuxla env  
# cpu first for runner.py and xla set for launcher.py 
if accelerator_name is None:
    try:
        import torch_xla  # noqa: F401,F811

        # torch_xla has no check such as torch_xla.is_available()
        accelerator_name = "cpuxla" # or "xla"
    except ImportError as e:
        pass

so it would just act/behave like the following for DS_ACCELERATOR=cpuxla
DS_ACCELERATOR=cpu python deepspeed/runner.py

On further thought, I think the above strategy will create problems in runner.py for logic such as

  1. This
  2. This
  3. This

Something to keep in mind.

With DS_ACCELERATOR=cpuxla

  1. visible_devices_envs() does not affect how cpuxla works, I see that when running this way, it sets the CUDA_VISIBLE_DEVICES but everything runs fine.
  2. device_count() will be a problem here, as cpuxla relies on the cpu_accelerator.py. I believe any pytorch xla imports will initializes it as the parent process. This though can be resolved by having cpuxla_accelerator.py, which still use most of basics API by CPU, but device_count() needs to be found through its metrics package tpu-info. Its has apis for detecting basics info like device and memory usage without initializing the xla runtime.
  3. export_envs() also does not affect this. I could potentially set PJRT_DEVICE=TPU env, but XLA works with GPU as well. Also the env will be automatically detected by pytorch xla.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants