Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use statically linked loader #2500

Open
mering opened this issue Dec 11, 2024 · 11 comments
Open

Use statically linked loader #2500

mering opened this issue Dec 11, 2024 · 11 comments

Comments

@mering
Copy link

mering commented Dec 11, 2024

🚀 feature request

Relevant Rules

py_binary

Description

With a similar motivation as #691, we would like to package a py_binary (including runfiles) into an oci_image and run it within a minimum base image like distroless_base in order to minimize the attack surface. This does not come with a shell and other tools which are required by #1929 so this unfortunately doesn't help us.

Describe the solution you'd like

Use a statically linked executable as loader.

Describe alternatives you've considered

Add more stuff to the base image. This is suboptimal as this does not only increase the size but also the attack surface.

@mering
Copy link
Author

mering commented Dec 11, 2024

I also noticed that Windows already uses a launcher executable:

"_launcher": attr.label(
cfg = "target",
# NOTE: This is an executable, but is only used for Windows. It
# can't have executable=True because the backing target is an
# empty target for other platforms.
default = "//tools/launcher:launcher",
),

I wrote such a launcher executable (template) for Linux to replace the stage1 bootloader script:

#include <errno.h>
#include <unistd.h>

#include <cstdlib>
#include <cstring>
#include <filesystem>
#include <iostream>
#include <memory>
#include <string>
#include <vector>

#include "tools/cpp/runfiles/runfiles.h"
using bazel::tools::cpp::runfiles::Runfiles;

std::string find_python_interpreter(Runfiles& runfiles,
                                    const std::string& interpreter_path) {
  if (interpreter_path.length() > 0 && interpreter_path[0] == '/') {
    // An absolute path, i.e. platform runtime
    return interpreter_path;
  } else if (interpreter_path.find('/') != std::string::npos) {
    // A runfiles-relative path
    return runfiles.Rlocation(interpreter_path);
  } else {
    // A plain word, e.g. "python3". Rely on searching PATH
    return interpreter_path;
  }
}

int main(int argc, char** argv) {
  std::string STAGE1_BOOTSTRAP = argv[0];
  std::string STAGE2_BOOTSTRAP = "%stage2_bootstrap%";
  std::string PYTHON_BINARY = "%python_binary%";

  std::string error;
  std::unique_ptr<Runfiles> runfiles(
      Runfiles::Create(STAGE1_BOOTSTRAP, BAZEL_CURRENT_REPOSITORY, &error));
  if (runfiles == nullptr) {
    std::cerr << "ERROR: Could not resolve runfiles root: " << error << std::endl;
    return 1;
  }

  std::string python_exe = find_python_interpreter(*runfiles, PYTHON_BINARY);
  if (!std::filesystem::is_regular_file(python_exe)) {
    std::cerr << "ERROR: Python interpreter not found: $python_exe"
              << std::endl;
    return 1;
  }
  // TODO check if executable to provide better error

  std::string stage2_bootstrap = runfiles->Rlocation(STAGE2_BOOTSTRAP);

  // Don't prepend a potentially unsafe path to sys.path
  // See: https://docs.python.org/3.11/using/cmdline.html#envvar-PYTHONSAFEPATH
  // NOTE: Only works for 3.11+
  // We inherit the value from the outer environment in case the user wants to
  // opt-out of using PYTHONSAFEPATH. To opt-out, they have to set
  // `PYTHONSAFEPATH=` (empty string). This is because Python treats the empty
  // value as false, and any non-empty value as true.
  int result = setenv("PYTHONSAFEPATH", "1", false);
  if (result != 0) {
    std::cerr << "ERROR: Failed to set PYTHONSAFEPATH: " << strerror(errno)
              << std::endl;
  }

  // TODO set RUNFILES_DIR env var to runfiles root
  // Why does runfiles->Rlocation(".") not work?

  // We use `exec` instead of a child process so that signals sent directly
  // (e.g. using `kill`) to this process (the PID seen by the calling process)
  // are received by the Python process. Otherwise, this process receives the
  // signal and would have to manually propagate it. See
  // https://github.com/bazelbuild/rules_python/issues/2043#issuecomment-2215469971
  // for more information.
  std::vector<const char*> args(argv + 1, argv + argc);
  args.insert(args.begin(), {python_exe.c_str(), stage2_bootstrap.c_str()});
  //  const_cast is safe: https://stackoverflow.com/a/19505361
  execvp(args[0], const_cast<char**>(args.data()));
  // If execvp returns, there was an error.
  std::cerr << "Error executing command\n";
  return 1;
}

This template needs to be evaluated to resolve the following variables:

  • %stage2_bootstrap%
  • %python_binary%

I tested this with the following BUILD file:

load("@rules_python//python:defs.bzl", "py_binary")
load("@rules_oci//oci:defs.bzl", "oci_image", "oci_load")
load("@rules_pkg//:pkg.bzl", "pkg_tar")
load("@rules_cc//cc:defs.bzl", "cc_binary")

genrule(
    name = "loader_src",
    srcs = ["loader.cc.tmpl"],
    outs = ["loader.cc"],
    # requires `--@rules_python//python/config_settings:bootstrap_impl=script` to create the stage2 bootstrap
    cmd = 'sed -e "s:%stage2_bootstrap%:_main/zz/_zz_stage2_bootstrap.py:" -e "s:%python_binary%:rules_python~~python~python_3_11_x86_64-unknown-linux-gnu/bin/python3:" "$<" > "$@"',
    local = 1,
)

cc_binary(
    name = "loader",
    srcs = ["loader.cc"],
    deps = [
        "@bazel_tools//tools/cpp/runfiles",
    ],
)

py_binary(
    name = "zz",
    srcs = ["zz.py"],
)

pkg_tar(
    name = "zz_layer",
    srcs = [
        "loader",
        ":zz",
    ],
    include_runfiles = True,
    strip_prefix = "/",
)

oci_image(
    name = "zz_image",
    base = "@distroless_base",
    entrypoint = ["/zz/loader"],
    tars = [":zz_layer"],
    workdir = "/",
)

oci_load(
    name = "zz_image.tar",
    image = ":zz_image",
    repo_tags = ["zz/zz_image:latest"],
)

@rickeylev
Copy link
Collaborator

cc @groodt who I think also liked the idea of a native binary to launch things

@groodt
Copy link
Collaborator

groodt commented Dec 11, 2024

I made a proposal a while ago, but nothing has really progressed: bazelbuild/proposals#275

I'm supportive of the idea, I'm just concerned about teams having to bring additional toolchains for compiling native launchers. Ideally it would be out of the box with bazel, since I think there are many interpreted languages that could benefit from a native launcher, but that arguably is more challenging to solve than solving it in rules_python.

Now that the rules are fully extracted out of bazelbuild/bazel, I imagine this could be something that is tackled eventually. But it's probably simpler to have a small docker image with python in it, or an approach like the one posted above, which is a neat solution to the problem.

@mering
Copy link
Author

mering commented Dec 11, 2024

I'm supportive of the idea, I'm just concerned about teams having to bring additional toolchains for compiling native launchers. Ideally it would be out of the box with bazel, since I think there are many interpreted languages that could benefit from a native launcher, but that arguably is more challenging to solve than solving it in rules_python.

As of #1929 we have the flag --@rules_python//python/config_settings:bootstrap_impl, so my suggestion would be to just add an additional option there and make it optional in the beginning. So the additional toolchain will only be used if requested explicitly. Also we would not need to add zip support in the beginning but could also add it later. If we move the launcher somewhere else later, this is only an implementation detail.

But it's probably simpler to have a small docker image with python in it

The problem is that it requires Python twice, once in the image to bootstrap and then additionally packaged as part of the runfiles. A full Python runtime is not "small".

@rickeylev
Copy link
Collaborator

re: code: That looks pretty promising!

The main case I think is missing is the zip case. I guess statically link zlib into it (and we don't necessarily have to use zip, could use another format)?

For prototyping this, having the py_executable macro create a cc_binary is probably the easiest thing to do. Wiring it in is probably going to be hacky, but such is a prototype.

For the final code, though, I see two options:
(1) Calling the cc APIs (cc_common et al); I'm not entirely sure on how to do that, but there's enough prior art that we can figure it out. All we really need to do is copy/paste some core part of how cc_binary performs compiling and linking.

(2) Use cc_binary as-is, but modifying it after-the-fact, similar to the windows launcher. If we had a way to modify the contents of a binary (to perform the string replacements necessary), then this seems preferable to (1). The windows launcher does some sort of trick to append a couple extra lines onto the binary, which works, but also seems a bit hacky. Being able to e.g. stick the paths in a special elf section or something seems much more appealing.

Also, this doesn't have to use C++. Anything that produces a native, standalone executable would suffice (e.g rust is all the rage now).

I also noticed that Windows already uses a launcher executable:

Yeah, it does except it's primitive and out of our control, so it's more of a headache than a help for us. I wanted to replace it with a powershell-based thing after bootstrap=script is made the default to reduce the number of ways we bootstrap programs.

have a separate bootstrap_impl value

Yep! That's one of the reasons I made that flag a string instead of boolean :)

@groodt
Copy link
Collaborator

groodt commented Dec 11, 2024

The problem is that it requires Python twice

There are sneaky things that can be done with the shebang so that it uses the hermetic interpreter, but I'll need to dig it out.

Overall, I agree though. Just noting that there are mechanisms to workaround this at the moment if desperate.

@rickeylev
Copy link
Collaborator

sneaky things that can be done with the shebang

With the script based bootstrap, you can probably more easily just use a custom stage1 bootstrap. This avoids any issues of trying to fit a program into 1 line that shebang process accepts.

Also, in Marcel's case, that may not work anyways -- he says his image doesn't have any shells available at all.

Hm, actually, I wonder if you could stick a prebuilt binary in as the stage1 bootstrap file. This might make it easier to prototype a native launcher, at the least.

It'll get passed through ctx.actions.expand_template, though, and I'm not sure if that handles binary data. Worst case, some flag somewhere to disable calling expand_template on it.

including python twice

An alternative is to use something like the runtime env toolchain or a platform runtime.

The runtime env toolchain's "interpreter" is simply a shell script that basically does exec python3 $@. You could, alternatively, point it to a prebuilt binary that did the same.

A platform runtime is when a fixed path is used, i.e. setting `py_runtime.interpreter_path = "/usr/bin/python".

@aignas
Copy link
Collaborator

aignas commented Dec 12, 2024

For completeness, if we want to distribute binaries together with rules_python releases, I think Aspect has a blogpost on something related: https://blog.aspect.build/releasing-bazel-rulesets-rust

They are using rust to create their launcher that builds a venv on the fly (if I remember correctly) but having checked their code it seems that they are still depending on a shell bootstrap: https://github.com/aspect-build/rules_py/blob/main/py/private/run.tmpl.sh.

Since we are already depending on rules_cc, I think using C/C++ could be the way to go. We should not be doing anything fancy to warrant the need of Rust. Zig could be also an option because it is easy to cross-compile, but I am not sure if bazel has good support for that one.

@mering
Copy link
Author

mering commented Dec 12, 2024

Also, this doesn't have to use C++. Anything that produces a native, standalone executable would suffice (e.g rust is all the rage now).

While I would have liked to pick another language, I decided for C++ because of the following reasons:

  • Most likely that the toolchain is already available in a Bazel workspace
  • Has a decent Bazel runfiles library
  • Doesn't require a runtime like Go

including python twice

An alternative is to use something like the runtime env toolchain or a platform runtime.

The runtime env toolchain's "interpreter" is simply a shell script that basically does exec python3 $@. You could, alternatively, point it to a prebuilt binary that did the same.

A platform runtime is when a fixed path is used, i.e. setting `py_runtime.interpreter_path = "/usr/bin/python".

We would prefer to just use the runtime as part of the runfiles in order to avoid knowing or figuring out where in the image the runtime is. Also we would like to be independent of the base image and not require others to add Python and configure the paths correctly.

For prototyping this, having the py_executable macro create a cc_binary is probably the easiest thing to do. Wiring it in is probably going to be hacky, but such is a prototype.

For the final code, though, I see two options: (1) Calling the cc APIs (cc_common et al); I'm not entirely sure on how to do that, but there's enough prior art that we can figure it out. All we really need to do is copy/paste some core part of how cc_binary performs compiling and linking.

(2) Use cc_binary as-is, but modifying it after-the-fact, similar to the windows launcher. If we had a way to modify the contents of a binary (to perform the string replacements necessary), then this seems preferable to (1). The windows launcher does some sort of trick to append a couple extra lines onto the binary, which works, but also seems a bit hacky. Being able to e.g. stick the paths in a special elf section or something seems much more appealing.

I only briefly checked the rules and it looks like there is no real macro as part of py_binary() which could be used to instantiate an additional rule. There is create_executable_rule() which returns a rule. All the other macros seem to be called as part of the rule implementation and also cannot just instantiate a cc_binary.
Interesting insights about the Windows launcher modifying the binary.

@rickeylev
Copy link
Collaborator

Ah, right, there isn't a common macro for both binaries and tests. Each has its own macro that calls its own rule (this isn't for a particular reason, just something that organically happened). python/private/py_binary_macro.bzl and python/private/py_test_macro.bzl are the macros for binaries and tests, respectively. Those would be the spots to modify to introduce additional targets.

@mering
Copy link
Author

mering commented Dec 12, 2024

Ah, right, there isn't a common macro for both binaries and tests. Each has its own macro that calls its own rule (this isn't for a particular reason, just something that organically happened). python/private/py_binary_macro.bzl and python/private/py_test_macro.bzl are the macros for binaries and tests, respectively. Those would be the spots to modify to introduce additional targets.

This is what I tried but getting the values for %python_binary% and %stage2_bootstrap% seems quite involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants