Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: support for "output only" builds #183

Open
smoser opened this issue Jul 6, 2021 · 6 comments
Open

RFE: support for "output only" builds #183

smoser opened this issue Jul 6, 2021 · 6 comments

Comments

@smoser
Copy link
Contributor

smoser commented Jul 6, 2021

Recently in stacker usage, I've been building a lot of "output only" layers. These containers build something and write its output to /output , and then are used by other layers in an import.

See example stacker.yaml below. The benefit of this approach is:

  • the 'build-only' layer does not need to be rebuilt when changes are made to layers that use its output.
  • more clear separation of inputs and outputs
  • the runtime container (use-stubby) does not have to have the requirements to build, just to consume.

The notably significant cost of this approach is space. the cached of 'build-stubby' will contain the whole build filesystem where only the /output are ever needed. In this example / would be on the order of ~700M (enough for c toolchain) but /output would be on the order of 1M or less.

One potential solution to this would be to allow the layer definition to define output-dir. The layer build process would then remove all other directories after build. It could potentially do some clever tricks with mounts or tmpfs to avoid a big/slow 'rm' of / at the end of the layer build.

build-stubby:
 from:
    type: built
    tag: build
 import:
    - https://github.com/puzzleos/stubby/archive/refs/tags/v1.0.0.tar.gz
 run:
    #!/bin/bash -ex
    mkdir /build
    cd /build
    tar -xvf /stacker/v1.0.0.tar.gz
    cd stubby
    make
    mkdir /output
    tar -cvzf /output/stubby-bin.tar.gz stubby.efi stubby-smash

use-stubby:
  from:
     type: built
     tag: minbase
  import:
     - stacker://build-stubby/output/stubby-bin.tar.gz
  run: |
      #!/bin/bash -ex
      mkdir /stubby
      tar -C /stubby -xf /stacker/stubby-bin.tar.gz
@tych0
Copy link
Collaborator

tych0 commented Jul 6, 2021

the 'build-only' layer does not need to be rebuilt when changes are made to layers that use its output.

At least for this one, that sounds like a bug. build_only layers shouldn't be re-built unless something in their set of {layer definition, base layer, imports} changes.

The notably significant cost of this approach is space. the cached of 'build-stubby' will contain the whole build filesystem where only the /output are ever needed. In this example / would be on the order of ~700M (enough for c toolchain) but /output would be on the order of 1M or less.

One potential solution to this would be to allow the layer definition to define output-dir. The layer build process would then remove all other directories after build. It could potentially do some clever tricks with mounts or tmpfs to avoid a big/slow 'rm' of / at the end of the layer build.

This sounds like a per-layer cache choice, which seems a little confusing to me. I'd rather just implement a --dont-cache option so we don't cache things everywhere, since either you care about disk space or you don't.

But to the meat of your proposal: it might be nice to have a /stacker-output or something where people can write stuff and it'll "automagically" show up somewhere else (probably the host rootfs, but then it would be easier to import). Or did you have something else in mind?

@smoser
Copy link
Contributor Author

smoser commented Jul 6, 2021

the 'build-only' layer does not need to be rebuilt when changes are made to layers that use its output.

At least for this one, that sounds like a bug. build_only layers shouldn't be re-built unless something in their set of {layer definition, base layer, imports} changes.

right. That works correctly. But when I change the use-stubby layer we don't have to (and do not) re-build the build-stubby layer. That is working as expected and is a benefit of breaking up the layers as I've done.

The notably significant cost of this approach is space. the cached of 'build-stubby' will contain the whole build filesystem where only the /output are ever needed. In this example / would be on the order of ~700M (enough for c toolchain) but /output would be on the order of 1M or less.
One potential solution to this would be to allow the layer definition to define output-dir. The layer build process would then remove all other directories after build. It could potentially do some clever tricks with mounts or tmpfs to avoid a big/slow 'rm' of / at the end of the layer build.

This sounds like a per-layer cache choice, which seems a little confusing to me. I'd rather just implement a --dont-cache option so we don't cache things everywhere, since either you care about disk space or you don't.

I don't think it is a per-layer cache choice. I'm not saying that I want the layer cached or not (I do), but the only content I care about is in /output. That is all that needs to be saved.

But to the meat of your proposal: it might be nice to have a /stacker-output or something where people can write stuff and it'll "automagically" show up somewhere else (probably the host rootfs, but then it would be easier to import). Or did you have something else in mind?

I had considered that as a separate feature request. Definitly copying content out of the container is something that would be useful. Currently the only real way to do that is with bind mount but if you do that then you rebuild every time.

In this particular case that would only complicate things, though. Right now I'm telling stacker (via import stacker:) that the use-stubby layer depends on the build-stubby layer. If that were just a filesystem path, then stacker would not have any dependency information.

i think what I'm asking is different though. I'm just wanting to provide information to stacker that it's cache does not need to keep the full layer, but only the specific locations. Normally stacker has to keep the entire layer, as it might be used via import or via from, but in this case it does not.

@tych0
Copy link
Collaborator

tych0 commented Jul 6, 2021

I don't think it is a per-layer cache choice. I'm not saying that I want the layer cached or not (I do), but the only content I care about is in /output. That is all that needs to be saved.

It is a per-layer cache, since other layers would have their rootfses cached, but this special kind would not. Worse yet, it's only certain directories that would be cached. I think if you're concerned about disk space, you don't want any of the layers cached, so we should implement an option for that if this is a concern. selective caching is a recipe for bugs and confusion, IMO.

i think what I'm asking is different though. I'm just wanting to provide information to stacker that it's cache does not need to keep the full layer, but only the specific locations. Normally stacker has to keep the entire layer, as it might be used via import or via from, but in this case it does not.

I guess I don't understand why this is important. Either you care about disk space or you don't, and if you don't, why not cache the extracted rootfs so you don't have to burn the cpu to re-extract it again in the future?

@tych0
Copy link
Collaborator

tych0 commented Jul 6, 2021

I had considered that as a separate feature request. Definitly copying content out of the container is something that would be useful. Currently the only real way to do that is with bind mount but if you do that then you rebuild every time.

yeah, using binds sucks. So much so that perhaps we should drop it in favor of an output location that gets cleaned every time or something, so that it wouldn't force people to rebuild.

@smoser
Copy link
Contributor Author

smoser commented Jul 6, 2021

I guess I don't understand why this is important. Either you care about disk space or you don't, and if you don't, why not cache the extracted rootfs so you don't have to burn the cpu to re-extract it again in the future?

I just don't think its a boolean. You might say I don't care about water usage because I have a leaky faucet, but I don't just leave the sprinkler running all day long.

I have lots of these little "build stuff" layers. They're very useful. Correct me if I'm wrong, but as example, say I have 10 "build stuff" layers and 1 "assemble stuff". Each of the "build stuff" layers build from a "dev" environment that is 20G and produce 100M of content in /output. Each of the "build stuff" layers will rebuild when their input changes and the "assemble stuff" layer will rebuild when any of the "build stuff" layers changed. "assemble stuff" is really just IO limited.

My cache would then cost me 200G (10 build-stuff layers * 20G). That can quickly fill up my NVME when all that was necessary to cache was the output of those build-stuff layers, which is 1G (10 * 100M). If you change the number of "build stuff" layers from 10 to 100 the cache would become infeasible.

@tych0
Copy link
Collaborator

tych0 commented Jul 6, 2021

My cache would then cost me 200G (10 build-stuff layers * 20G).

Yes, this is true if you have 10 layers each with a different base and different layer hashes that total 20G each. But I doubt that's really the case; if you have one big "this is my build env" image, presumably all the "build stuff" layers use this same base that data will be shared. Put another way, a hash is present in two different base layers, the data is not duplicated in stacker's cache.

That can quickly fill up my NVME when all that was necessary to cache was the output of those build-stuff layers, which is 1G (10 * 100M). If you change the number of "build stuff" layers from 10 to 100 the cache would become infeasible.

I'm not sure I want to write and maintain a bunch of code that is worried about users who want to take as input 2TB of different bits. gunzipping 2TB (assuming 50 MB/s gunzip rate) would take 11 hours, which I'm sure people will also complain about, and want caching for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants