Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert build system to Bazel. #2887

Closed
robertnishihara opened this issue Sep 15, 2018 · 36 comments · Fixed by #4493
Closed

Convert build system to Bazel. #2887

robertnishihara opened this issue Sep 15, 2018 · 36 comments · Fixed by #4493
Assignees

Comments

@robertnishihara
Copy link
Collaborator

It seems like using Bazel (https://github.com/bazelbuild/bazel) would improve our build system in a lot of ways.

@rshin did an early version of this a long time ago in ray-project/ray-legacy#408. @ericl, @concretevitamin, @pcmoritz have all used Bazel I believe and are proponents. @rsepassi and @dbieber have recently worked on this and have a good sense of what is involved.

Any thoughts about this? Any drawbacks to be aware of?

cc @chuxi @guoyuhong @raulchen

@robertnishihara robertnishihara added the question Just a question :) label Sep 15, 2018
@robertnishihara robertnishihara changed the title Convert build system to bazel. Convert build system to Bazel. Sep 15, 2018
@guoyuhong
Copy link
Contributor

Will CMake script be retired or will both ways be kept? Currently, there are duplicated building for thirdparty libraries, e.g. zlib for Arrow and Parquet, etc. Avoiding duplicated building may improve the building time.

@chuxi
Copy link
Contributor

chuxi commented Sep 17, 2018

Hi @robertnishihara ,
actually, while I was working on cmake, I was confused with two problems:

  1. how to manage dependency and its transitive dependency in cmake.
  2. how to package and deploy.

I ever referred to tensorflow project to find out cmake best practice. And draw a principle for thirdparty dependency: ExternalProject only.

In maven, which is the cornerstone, it has maven center repository and local repository. It helps manage all downloaded jars and you can use them in your maven project.

So when it comes to cmake, cmake should also be able to reuse the downloaded dependencies. I would suggest to use FindXXXX.cmake to speed up the (second) build process. However, I still can not deal with the transitive problem unless you wrote each ExternalProject.

So ... for c++ dependency, I need a mechanism to support it. cmake is not enough.

I did not package ray yet, because it is already packaged by python wheels. But for cmake package mechanism, which is based on gcc library links, I found in tensorflow, it uses static library instead of dynamic for thirdparty libraries. em... I have to admit it is quite reasonable.


I have not used bazel yet, but I found it is easy to understand once you know cmake. And it has the ability to support other languages, which is good for python and java. Also It has better third-party dependency management.

The key problem may be it is too new to cooperate with IDEs, which for me it is important to debug.

And at last, if you'd like, we can keep cmake and bazel both solutions. In tensorflow, it has cmake in a standalone directory which contains the whole tensorflow project.

@robertnishihara
Copy link
Collaborator Author

@guoyuhong @chuxi We could potentially keep around cmake for IDE reasons, but I think it would be substantially less work to just retire cmake. @chuxi where is the TensorFlow directory that you're talking about?

@chuxi
Copy link
Contributor

chuxi commented Sep 17, 2018

Will CMake script be retired or will both ways be kept? Currently, there are duplicated building for thirdparty libraries, e.g. zlib for Arrow and Parquet, etc. Avoiding duplicated building may improve the building time.

yes, transitive dependency problem. And I am considering to write some findXXX modules, which support to find a directory which contains all dependency.

we define a cmake property: CMAKE_EXTERNAL_HOME, by default it is ${CMKAE_CURRENT_BINARY_DIR}/build/external. And then we write some findXXX.cmake modules based on the directory.

Then in the ${CMAKE_EXTERNAL_HOME}, all dependencies are kept and the user could get benefit from it.

And for user he can change it to his own directory. So for each build he can get the benefit from local pre build projects.

@chuxi
Copy link
Contributor

chuxi commented Sep 17, 2018

@guoyuhong @chuxi We could potentially keep around cmake for IDE reasons, but I think it would be substantially less work to just retire cmake. @chuxi where is the TensorFlow directory that you're talking about?

tensorflow/contrib/cmake the directory has cmake files for tensorflow, I read and found it works, but may not be always updated as bazel.

@raulchen
Copy link
Contributor

  1. I don't have much experience with Bazel. From its web site https://www.bazel.build/, it doesn't claim support for Python. Would this be a problem for us?
  2. Will buck (https://buckbuild.com/) also be an option? It's similar to bazel. I don't have a in-depth comparison for buck and bazel. But IFAIK, buck has good support for C++, python and java.

@concretevitamin
Copy link
Contributor

Bazel supports Python: https://docs.bazel.build/versions/master/be/python.html and is backed by Google.

@ericl
Copy link
Contributor

ericl commented Sep 17, 2018

I looked a bit into the currently available python support:

Overall it looks workable, though it would probably make sense to tackle Python build conversion incrementally since things aren't fully baked here.

Re: buck, my impression is that does not have much adoption outside of facebook.

@rsepassi
Copy link
Contributor

We've got all C++ dependencies building internally at v0.5.2. I've opened a PR with a partial BUILD file that still needs the external stuff to be added. Totally fine to not merge that, but thought I'd send along so that you all have a starting point.

@rsepassi
Copy link
Contributor

Ideally the code would be restructured so that there's a BUILD file per directory and things are a bit more decoupled. But that would be a lot more work to disentangle stuff. The BUILD file in the PR was written so as to require no/minimal changes to the source given the cmake build.

@rsepassi
Copy link
Contributor

For the external dependencies, it looks like each would have a new_git_repository rule to download them.

A genrule could be used to call the cmake command (or whatever bash command needs to be called to build it).

A cc_library can then wrap all the compiled units, or if it's easy you can write BUILD rules for them (gtest example).

@mitar
Copy link
Member

mitar commented Sep 27, 2018

I had quite some issues in the past installing Bazel to compile some libraries on some supercomputing clusters (like Berkeley's Savio) because it depends on Java. And then you have this problem that to compile a simple thing which does not use Java, you have to get Java onto the system. Was not impressed.

@robertnishihara
Copy link
Collaborator Author

@pcmoritz, if we use Arrow binary artifacts instead of compiling Arrow, then the switch to Bazel will probably be much easier.

Note

  • This requires frequent (e.g., nightly) and persistent binary artifacts from Arrow.
  • This makes it harder to test changes to Arrow in Ray.

@rsepassi
Copy link
Contributor

rsepassi commented Dec 4, 2018 via email

@laurentlb
Copy link

To interface with cmake, see also https://github.com/bazelbuild/rules_foreign_cc (cc @irengrig).

@robertnishihara
Copy link
Collaborator Author

Thanks @rsepassi and @laurentlb!

@robertnishihara
Copy link
Collaborator Author

Mostly done in #3898. We still aren't using Bazel to build our wheels because the Bazel build upgraded Arrow, which is causing other issues (see #4129).

@zbarry
Copy link
Contributor

zbarry commented Feb 25, 2019

Might already know this, but just in case, I think Bazel wasn't added as a dependency yet on master, so this breaks compilation unless the user explicitly installs it before running pip install -v -e .

@zbarry
Copy link
Contributor

zbarry commented Feb 25, 2019

Looks like some error also pops up during redis compilation:

    config.status: creating src/glog/stl_logging.h
    config.status: creating libglog.pc
    config.status: creating src/config.h
    config.status: executing depfiles commands
    config.status: executing libtool commands
    'src/config.h' -> 'bazel-out/k8-opt/genfiles/external/com_github_google_glog/config.h'
    'src/glog/logging.h' -> 'bazel-out/k8-opt/genfiles/external/com_github_google_glog/src/glog/logging.h'
    'src/glog/raw_logging.h' -> 'bazel-out/k8-opt/genfiles/external/com_github_google_glog/src/glog/raw_logging.h'
    'src/glog/stl_logging.h' -> 'bazel-out/k8-opt/genfiles/external/com_github_google_glog/src/glog/stl_logging.h'
    'src/glog/vlog_is_on.h' -> 'bazel-out/k8-opt/genfiles/external/com_github_google_glog/src/glog/vlog_is_on.h'
    [137 / 163] Compiling src/ray/raylet/reconstruction_policy.cc; 80s processwrapper-sandbox ... (24 actions running)
    INFO: From Compiling src/ray/thirdparty/hiredis/dict.c:
    src/ray/thirdparty/hiredis/dict.c:53:21: warning: 'dictGenHashFunction' defined but not used [-Wunused-function]
     static unsigned int dictGenHashFunction(const unsigned char *buf, int len) {
                         ^
    src/ray/thirdparty/hiredis/dict.c:73:14: warning: 'dictCreate' defined but not used [-Wunused-function]
     static dict *dictCreate(dictType *type, void *privDataPtr) {
                  ^
    src/ray/thirdparty/hiredis/dict.c:160:12: warning: 'dictReplace' defined but not used [-Wunused-function]
     static int dictReplace(dict *ht, void *key, void *val) {
                ^
    src/ray/thirdparty/hiredis/dict.c:182:12: warning: 'dictDelete' defined but not used [-Wunused-function]
     static int dictDelete(dict *ht, const void *key) {
                ^
    src/ray/thirdparty/hiredis/dict.c:238:13: warning: 'dictRelease' defined but not used [-Wunused-function]
     static void dictRelease(dict *ht) {
                 ^
    src/ray/thirdparty/hiredis/dict.c:258:22: warning: 'dictGetIterator' defined but not used [-Wunused-function]
     static dictIterator *dictGetIterator(dict *ht) {
                          ^
    src/ray/thirdparty/hiredis/dict.c:268:19: warning: 'dictNext' defined but not used [-Wunused-function]
     static dictEntry *dictNext(dictIterator *iter) {
                       ^
    src/ray/thirdparty/hiredis/dict.c:288:13: warning: 'dictReleaseIterator' defined but not used [-Wunused-function]
     static void dictReleaseIterator(dictIterator *iter) {
                 ^
    ERROR: /data/dtk-pipeline/ray/BUILD.bazel:511:1: Executing genrule //:redis failed (Exit 2) bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped)

    Use --sandbox_debug to see verbose messages from the sandbox
    + curl -sL https://github.com/antirez/redis/archive/5.0.3.tar.gz
    + tar xz --strip-components=1 -C .

    gzip: stdin: unexpected end of file
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now
    [166 / 191] Compiling src/ray/raylet/worker.cc; 183s processwrapper-sandbox ... (20 actions running)
    Target //:ray_pkg failed to build
    Use --verbose_failures to see the command lines of failed build steps.
    INFO: Elapsed time: 1093.283s, Critical Path: 239.37s
    INFO: 136 processes: 136 processwrapper-sandbox.
    FAILED: Build did NOT complete successfully
    FAILED: Build did NOT complete successfully
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/data/dtk-pipeline/ray/python/setup.py", line 188, in <module>
        license="Apache 2.0")
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup
        return distutils.core.setup(**attrs)
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/site-packages/setuptools/command/develop.py", line 38, in run
        self.install_for_development()
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/site-packages/setuptools/command/develop.py", line 140, in install_for_development
        self.run_command('build_ext')
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/data/dtk-pipeline/ray/python/setup.py", line 81, in run
        subprocess.check_call(["../build.sh", "-p", sys.executable])
      File "/scratch/anaconda3/envs/pipe/lib/python3.7/subprocess.py", line 347, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['../build.sh', '-p', '/scratch/anaconda3/envs/pipe/bin/python']' returned non-zero exit status 1.
Cleaning up...
Removed build tracker '/tmp/pip-req-tracker-kethrq6g'

@pcmoritz
Copy link
Contributor

Hmm, I haven't seen this one. What happens if you do curl -sL https://github.com/antirez/redis/archive/5.0.3.tar.gz manually and try to extract the file with tar xz --strip-components=1 -C .? Does it give the same error?

@zbarry
Copy link
Contributor

zbarry commented Feb 28, 2019

Sorry for delay. Found something really weird. Using the system curl, it works just fine. The (my) anaconda curl is what is "broken".

System: CentOS 7
uname -r: 3.10.0-693.21.1.el7.x86_64

[user@host ~]$ /usr/bin/curl --version
curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.28.4 zlib/1.2.11 libidn/1.28 libssh2/1.4.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp 
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz unix-sockets 

With anaconda3:

[user@host ~]$ which curl
/scratch/anaconda3/bin/curl
[user@host ~]$ curl --version
curl 7.63.0 (x86_64-conda_cos6-linux-gnu) libcurl/7.63.0 OpenSSL/1.1.1a zlib/1.2.11 libssh2/1.8.0
Release-Date: 2018-12-12
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP UnixSockets HTTPS-proxy 

It appears that curl -sL https://github.com/antirez/redis/archive/5.0.3.tar.gz just returns instantly with the anaconda 3 curl, with no content. For what it's worth, curl https://google.com -sL seems to spit out the google homepage to the terminal just fine with this same curl... weird.

E: ok, same problem / error message either way. Not knowing much about curl, it appears that the curl blah -sL just spits the output straight to the terminal (at least for my system / anaconda curl versions). If tar is expecting a file on disk instead of a stream based on how the build system is set up (no idea), I guess this might be the problem?

    ERROR: /data/ray/BUILD.bazel:515:1: Executing genrule //:redis failed (Exit 2) bash failed: error executing command /bin/bash -c ... (remainin
g 1 argument(s) skipped)                                                                                                                          

    Use --sandbox_debug to see verbose messages from the sandbox
    + curl -sL https://github.com/antirez/redis/archive/5.0.3.tar.gz
    + tar xz --strip-components=1 -C .

    gzip: stdin: unexpected end of file
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now
    [306 / 317] Compiling src/ray/raylet/reconstruction_policy.cc; 1s processwrapper-sandbox ... (7 actions running)
    [306 / 317] Compiling src/ray/raylet/reconstruction_policy.cc; 11s processwrapper-sandbox ... (7 actions running)

@zbarry
Copy link
Contributor

zbarry commented Mar 4, 2019

Any idea what may be happening? Weird that none of the previous curls fail until this one is hit. Would be nice to be able to compile bleeding-edge Ray versions

@robertnishihara
Copy link
Collaborator Author

Hm, for me curl -sL "https://github.com/antirez/redis/archive/5.0.3.tar.gz" also appears to return nothing, but the composition curl -sL "https://github.com/antirez/redis/archive/5.0.3.tar.gz" | tar xz --strip-components=1 -C . seems to do the right thing. This is with

$ which curl
/Users/rnishihara/anaconda3/bin/curl
$ curl --version
curl 7.58.0 (x86_64-apple-darwin13.4.0) libcurl/7.58.0 OpenSSL/1.0.2n zlib/1.2.11
Release-Date: 2018-01-24
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP UnixSockets HTTPS-proxy 

@zbarry
Copy link
Contributor

zbarry commented Mar 4, 2019

Your composition works for me, too, so even more confused now... is that the actual command being run by the build system, though?

@robertnishihara
Copy link
Collaborator Author

The remaining issues that need to be addressed before we deprecate cmake are #4273 and #4272.

@robertnishihara
Copy link
Collaborator Author

@zbarry I've seen the same error that you mentioned in our CI (on Travis) nondeterministically. It's possible that we need to just retry that command if it fails, though it'd be nice to have a more reliable approach.

@zbarry
Copy link
Contributor

zbarry commented Mar 5, 2019

Gotcha. Let me get a fresh repo + conda env going and see what happens

@zbarry
Copy link
Contributor

zbarry commented Mar 5, 2019

Hmm. At least in my case, I don't think the issue is a Monte Carlo curl.

I changed the bazel build line in build.sh to be:

bazel build //:ray_pkg -c opt --sandbox_debug --verbose_failures

which gave the exact command being executed. Maybe this is helpful(?). Here, you can see clearly that the command run is indeed curl -sL "https://github.com/antirez/redis/archive/5.0.3.tar.gz" | tar xz --strip-components=1 -C ., which works if I just paste into a terminal.

    [99 / 177] Executing genrule @com_github_google_glog//:run_configure; 9s processwrapper-sandbox ... (24 actions running)
    ERROR: /data/external_repos/ray/BUILD.bazel:515:1: Executing genrule //:redis failed (Exit 2) process-wrapper failed: error executing command
      (cd /data/external_repos/ray/cache/_bazel_me/e4ce00b73b332e14ab5d91799de1ce7a/execroot/__main__ && \
      exec env - \
        LD_LIBRARY_PATH=/cm/shared/apps/uge/8.6.0/lib/lx-amd64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/scratch/anaconda3/lib/libjpeg-turbo/lib:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/scratch/anaconda3/lib/libjpeg-turbo/lib \
        PATH=/scratch/anaconda3/envs/raybuild/bin:/scratch/anaconda3/condabin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin \
        TMPDIR=/tmp \
      /data/external_repos/ray/cache/_bazel_me/install/25bd58125e78809ebbc928ae699c5e3d/_embedded_binaries/process-wrapper '--timeout=0' '--kill_delay=15' /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh;
            set -x &&
            curl -sL "https://github.com/antirez/redis/archive/5.0.3.tar.gz" | tar xz --strip-components=1 -C . &&
            make &&
            mv ./src/redis-server bazel-out/k8-opt/genfiles/redis-server &&
            chmod +x bazel-out/k8-opt/genfiles/redis-server &&
            mv ./src/redis-cli bazel-out/k8-opt/genfiles/redis-cli &&
            chmod +x bazel-out/k8-opt/genfiles/redis-cli
        '): process-wrapper failed: error executing command
      (cd /data/external_repos/ray/cache/_bazel_me/e4ce00b73b332e14ab5d91799de1ce7a/execroot/__main__ && \
      exec env - \
        LD_LIBRARY_PATH=/cm/shared/apps/uge/8.6.0/lib/lx-amd64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/scratch/anaconda3/lib/libjpeg-turbo/lib:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/scratch/anaconda3/lib/libjpeg-turbo/lib \
        PATH=/scratch/anaconda3/envs/raybuild/bin:/scratch/anaconda3/condabin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin \
        TMPDIR=/tmp \
      /data/external_repos/ray/cache/_bazel_me/install/25bd58125e78809ebbc928ae699c5e3d/_embedded_binaries/process-wrapper '--timeout=0' '--kill_delay=15' /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh;
            set -x &&
            curl -sL "https://github.com/antirez/redis/archive/5.0.3.tar.gz" | tar xz --strip-components=1 -C . &&
            make &&
            mv ./src/redis-server bazel-out/k8-opt/genfiles/redis-server &&
            chmod +x bazel-out/k8-opt/genfiles/redis-server &&
            mv ./src/redis-cli bazel-out/k8-opt/genfiles/redis-cli &&
            chmod +x bazel-out/k8-opt/genfiles/redis-cli
        ')
    + curl -sL https://github.com/antirez/redis/archive/5.0.3.tar.gz
    + tar xz --strip-components=1 -C .

    gzip: stdin: unexpected end of file
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now
    Target //:ray_pkg failed to build
    INFO: Elapsed time: 15.938s, Critical Path: 11.03s
    INFO: 67 processes: 67 processwrapper-sandbox.
    FAILED: Build did NOT complete successfully
    FAILED: Build did NOT complete successfully
    Traceback (most recent call last):

@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 5, 2019

Hmm, I don't really see what is going wrong here. As a workaround, do you want to try replacing curl by wget and see if that fixes the problem?

@zbarry
Copy link
Contributor

zbarry commented Mar 5, 2019

Ugh. I figured it out because wget gave me a useful error message. For whatever reason, that particular genrule is not pulling my http_proxy and https_proxy environment variables in, so github.com is not being resolved... I added them explicitly to the rule via export http_proxy=blah, and Ray built to completion. Is there a way to make sure these vars (I guess all the user's environment vars?) are present in the build environment?

@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 5, 2019

So not including your environment variables is actually a feature of Bazel (it's supposed to create the same build environment on every machine, this also means having a well-defined set of environment variables).

For use cases like this, you can use the following functionality I think: https://bazel.build/designs/2016/06/21/environment.html

If this helps, contributing a patch that allows to pass arguments from setup.py into bazel would be welcome!

@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 5, 2019

Actually, I think putting the option into your .bazelrc would also work and is probably a better solution.

@zbarry
Copy link
Contributor

zbarry commented Mar 5, 2019

Thanks. Will look into modding .bazelrc. Those --sandbox_debug --verbose_failures flags might be worth adding into the build script, by the way!

@zbarry
Copy link
Contributor

zbarry commented Mar 5, 2019

Ok. Good to go. Since that Bazel env config page is not super explicit about how to specify vars to pass through the build environment, for others that have the misfortune of living life behind a proxy, just plop these into ~/.bazelrc:

build --action_env=http_proxy
build --action_env=https_proxy

@robertnishihara
Copy link
Collaborator Author

@zbarry thanks for the fix, @jliagouris just ran into the same problem and your suggestion worked.

@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 6, 2019

@zbarry Thanks for the suggestion, I added the flags here: #4278

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.