Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweak libpaths in TensorFlow easyblock by adding directory containing libnccl.so.2 #3497

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions easybuild/easyblocks/t/tensorflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -701,13 +701,22 @@ def configure_step(self):
})
else:
raise EasyBuildError("TensorFlow has a strict dependency on cuDNN if CUDA is enabled")

if nccl_root:
nccl_version = get_software_version('NCCL')
# Ignore the PKG_REVISION identifier if it exists (i.e., report 2.4.6 for 2.4.6-1 or 2.4.6-2)
nccl_version = nccl_version.split('-')[0]
config_env_vars.update({
'NCCL_INSTALL_PATH': nccl_root,
})

# add path to libnccl.so.2 directory provided by NCCL when both sysroot
# and RPATH are used (such as in EESSI)
if build_option('sysroot') and self.toolchain.use_rpath:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is build_option('sysroot') a necessary condition here? The core issue is use of rpath and lack of LD_LIBRARY_PATH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not a necessary condition. It rather limits when this code would be executed.

An alternative approach could be to use some environment variable which could contain paths to be added to LIBRARY_PATH. In this easyblock we could check if it is set and append the paths to the third tuple element. In EESSI, we could then set this in a hook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general issue though, if you have that configuration (rpath and no LD_LIBRARY_PATH) then you will run into this problem. The solution is specific to NCCL, and that is fine. I wouldn't introduce something arbitrary that would affect reproducibility.

You could use something similar to

filtered_env_vars = build_option('filter_env_vars') or []
if 'LD_LIBRARY_PATH' in filtered_env_vars and 'LIBRARY_PATH' not in filtered_env_vars:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, if this is not specific to using an alternate sysroot, then remove that part of the condition, there's no need to artificially restrict this to the EESSI context...

Copy link
Contributor Author

@trz42 trz42 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, it seems not specific to using an alternate sysroot.

Changed the conditional expression (see edc9bfe) and tested this. After the change, libnccl.so.2 is found.

Note, there is another issue building TensorFlow with EESSI, but I may know how to fix that. It likely requires changing the shebang so scripts use env from the compat layer. Changing the scripts will need some addition work to circumvent some sanity check run by Bazel (see https://stackoverflow.com/questions/47775668/bazel-how-to-skip-corrupt-installation-on-centos6).

The latter fix can be done in the easyconfig or in a hook.

system_libs_info_as_list = list(self.system_libs_info)
system_libs_info_as_list[2].append(os.path.join(nccl_root, get_software_libdir('NCCL')))
self.system_libs_info = tuple(system_libs_info_as_list)
Copy link
Contributor

@Flamefire Flamefire Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part looks very fishy and I had to read the full code to verify this is correct. We should make get_systems_libs and hence self.system_libs_info a named tuple instead to make it easier to understand.

However from a semantic POV this is the wrong place to add NCCL: "System libs" in the context of tensorflow are dependencies that can be vendored in a way TF understands. I.e. https://github.com/tensorflow/tensorflow/blob/master/third_party/systemlibs/syslibs_configure.bzl#L11

I'd rather put this into the build_step where action_env['LIBRARY_PATH'] is set. The easiest way would be to (conditionally) append to libpaths right after cpaths, libpaths = self.system_libs_info[1:]

This is also easier to understand due to the comment there: "Make TF find our modules. LD_LIBRARY_PATH gets automatically added by configure.py"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an attempt in f697d97

Not tested yet. Will let you know if it works or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in f697d97 worked.


else:
nccl_version = '1.3' # Use simple downloadable version
config_env_vars.update({
Expand Down
Loading