Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel extraction for zip files #366

Open
tillig opened this issue Jul 31, 2024 · 0 comments
Open

Support parallel extraction for zip files #366

tillig opened this issue Jul 31, 2024 · 0 comments

Comments

@tillig
Copy link

tillig commented Jul 31, 2024

We have a somewhat unique situation where, when nodeenv downloads Node and then calls extractall it can take up to three or four hours to execute the extraction. This is due to various security scanners we are required to use and the fact that extractall is a synchronous/one-at-a-time extraction operation. Note this is on Windows, so there isn't currently an option to support the system Node (which would also solve a lot of problems).

Specifically, we're running into this in the context of using pre-commit, which for each Node-based pre-commit validator, sets up a separate Node environment using nodeenv. If you have four or five Node-based hooks, that means it can take up to a day to get pre-commit initialized and then when it's time to update a hook... be prepared to spend some time.

I downloaded a single version of the Node.js zip file locally just to test the differences. I replicated the download_node_src method (basically) and just had it extract those files in the way that works now.

Running this script takes three hours to finish extracting for me.

import operator
import re
import zipfile

def main():
    ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip')
    members = operator.methodcaller('namelist')
    member_name = lambda s: s  # noqa: E731
    args_node = "22.4.1"
    src_dir = "C:\\Users\\username\\temp\\extract-destination"
    with ctx as archive:
        node_ver = re.escape(args_node)
        rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
            % node_ver
        extract_list = [
            member
            for member in members(archive)
            if re.match(rexp_string, member_name(member)) is None
        ]
        archive.extractall(src_dir, extract_list)

if __name__ == '__main__':
    main()

I found this interesting blog article that explained how to use the ThreadPoolExecutor to unzip in parallel. This allows me to unzip in three minutes because the security scanner can do its thing in parallel along with the thread pool. In the example below I have it set to 100 threads. If I increase that to 200 threads, it cuts the corresponding time in half to about 90 seconds.

import operator
import re
import zipfile
from concurrent.futures import ThreadPoolExecutor

def unzip_file(handle, filename, path):
    handle.extract(filename, path)

def main():
    ctx = zipfile.ZipFile('node-v22.4.1-win-x64.zip', 'r')
    members = operator.methodcaller('namelist')
    member_name = lambda s: s  # noqa: E731
    args_node = "22.4.1"
    src_dir = "C:\\Users\\username\\temp\\extract-destination"
    with ctx as archive:
        node_ver = re.escape(args_node)
        rexp_string = r"node-v%s[^/]*/(README\.md|CHANGELOG\.md|LICENSE)"\
            % node_ver
        extract_list = [
            member
            for member in members(archive)
            if re.match(rexp_string, member_name(member)) is None
        ]
        with ThreadPoolExecutor(100) as exe:
            _ = [exe.submit(unzip_file, ctx, m, src_dir) for m in extract_list]

if __name__ == '__main__':
    main()

I'm curious if this project would be interested in a pull request to update the zip file extraction to work in parallel. I'm not a huge Python developer but I'd be happy to give it a go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant