-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyBi: consider tar instead of zip #1
Comments
Hey, how's it going :-) Zip files do have some kind of headers on each file to let you process them in a single pass: https://docs.rs/zip/latest/zip/read/fn.read_zipfile_from_stream.html Also, I think symlinks are equally native in both zips and tars, i.e., they both have a standard way to represent them the normal command-line tools already support. The draft spec goes into detail because historically wheel tools like pip haven't bothered implementing this and it's slightly annoying to go look up the details, but it's not a big deal. It's also very handy that pybis support random access, e.g. so you can extract the METADATA file before installing. Also, this lets you do some cute tricks to fetch METADATA without downloading the whole file – which will hopefully stop being useful once https://peps.python.org/pep-0658/ is deployed, but for now it's kind of important. The worst part about zip files is the poor compression ratios. The spec allows fancier algorithms like zstd or lzma, but in practice most tools don't support this – and even if they did, the compression ratio would still be poor compared to tarballs, b/c of how each file is compressed separately. The best of both worlds would be if the zip spec had a way to include a zstd dictionary to use when decompressing, so you can still do random access but with whole-file-like compression-ratios... but alas, this isn't standardized at all. Anyway... overall I feel like both options have some advantages, but none of them are overwhelming; both ways can work. And given that wheels are already committed to the zip format, and we have lots of existing tooling around that, I think it's best to keep things consistent. And if we want to come up with a better format for both wheels and pybis, then that might be a great idea, but probably better to factor it off into its own project instead of trying to do everything at once :-) |
I strongly suspect any bottlenecks on installation are going to be something other than the lack of streaming. For instance we had massive performance challenges with Rustup on Windows until we both moved all syscalls to a threadpool and also got rustup whitelisted by MS defender to avoid thrashing the CPU during doc extraction. Rustup installs from tar files FWIW and our current performance challenge is tar's requirement for serial processing - our packages are size optimised. So I'd suggest looking at reading the directory then parallel unpacking all the files from the archive, and looking closely at IO effects and the like. |
PyBi artifacts are specified to be zip files.
One problem with zip files is that the TOC is at the end of the file, which means that they do not support streaming decompression/extraction. This will limit how fast they can be installed even on fast networks.
Tar files are designed for streaming extraction and don't have this problem. They also support symlinks natively instead of needing the workaround in the spec. And they use stream-level compression which means they support arbitrary compression schemes.
(Full disclosure: I just implemented something very similar to PyBi internally to my firm and I get sub-second installs with something like
curl https://.../ | zstd -d | tar x
. Zstd is right for us with on-prem caches, but over the Internet small sizes are preferable.)The text was updated successfully, but these errors were encountered: