-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tweak libpaths in TensorFlow easyblock by adding directory containing libnccl.so.2 #3497
base: develop
Are you sure you want to change the base?
Changes from 3 commits
13dd418
0f331a8
f6a9afd
edc9bfe
f697d97
68d89b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -701,13 +701,22 @@ def configure_step(self): | |
}) | ||
else: | ||
raise EasyBuildError("TensorFlow has a strict dependency on cuDNN if CUDA is enabled") | ||
|
||
if nccl_root: | ||
nccl_version = get_software_version('NCCL') | ||
# Ignore the PKG_REVISION identifier if it exists (i.e., report 2.4.6 for 2.4.6-1 or 2.4.6-2) | ||
nccl_version = nccl_version.split('-')[0] | ||
config_env_vars.update({ | ||
'NCCL_INSTALL_PATH': nccl_root, | ||
}) | ||
|
||
# add path to libnccl.so.2 directory provided by NCCL when both sysroot | ||
# and RPATH are used (such as in EESSI) | ||
if build_option('sysroot') and self.toolchain.use_rpath: | ||
system_libs_info_as_list = list(self.system_libs_info) | ||
system_libs_info_as_list[2].append(os.path.join(nccl_root, get_software_libdir('NCCL'))) | ||
self.system_libs_info = tuple(system_libs_info_as_list) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That part looks very fishy and I had to read the full code to verify this is correct. We should make However from a semantic POV this is the wrong place to add NCCL: "System libs" in the context of tensorflow are dependencies that can be vendored in a way TF understands. I.e. https://github.com/tensorflow/tensorflow/blob/master/third_party/systemlibs/syslibs_configure.bzl#L11 I'd rather put this into the This is also easier to understand due to the comment there: "Make TF find our modules. LD_LIBRARY_PATH gets automatically added by configure.py" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds right to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Made an attempt in f697d97 Not tested yet. Will let you know if it works or not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The changes in f697d97 worked. |
||
|
||
else: | ||
nccl_version = '1.3' # Use simple downloadable version | ||
config_env_vars.update({ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
build_option('sysroot')
a necessary condition here? The core issue is use of rpath and lack ofLD_LIBRARY_PATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s not a necessary condition. It rather limits when this code would be executed.
An alternative approach could be to use some environment variable which could contain paths to be added to
LIBRARY_PATH
. In this easyblock we could check if it is set and append the paths to the third tuple element. In EESSI, we could then set this in a hook.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general issue though, if you have that configuration (rpath and no
LD_LIBRARY_PATH
) then you will run into this problem. The solution is specific to NCCL, and that is fine. I wouldn't introduce something arbitrary that would affect reproducibility.You could use something similar to
easybuild-easyblocks/easybuild/easyblocks/p/python.py
Lines 199 to 200 in 2ab3cbc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, if this is not specific to using an alternate sysroot, then remove that part of the condition, there's no need to artificially restrict this to the EESSI context...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, it seems not specific to using an alternate sysroot.
Changed the conditional expression (see edc9bfe) and tested this. After the change,
libnccl.so.2
is found.Note, there is another issue building TensorFlow with EESSI, but I may know how to fix that. It likely requires changing the shebang so scripts use
env
from the compat layer. Changing the scripts will need some addition work to circumvent some sanity check run by Bazel (see https://stackoverflow.com/questions/47775668/bazel-how-to-skip-corrupt-installation-on-centos6).The latter fix can be done in the easyconfig or in a hook.