Concurrent local extraction of binaries fails sometimes #457

travisdowns · 2024-11-08T13:58:32Z

When the node_exporter role is used to install the exporter on > 1 hosts we see occasional failures of binary extraction after download like so:

TASK [prometheus.prometheus._common : Unpack binary archive node_exporter-1.8.2.linux-arm64.tar.gz] ***
changed: [52.27.162.6 -> localhost]
changed: [34.212.107.78 -> localhost]
changed: [35.160.16.87 -> localhost]
changed: [34.211.30.255 -> localhost]
fatal: [35.91.204.184 -> localhost]: FAILED! => {"changed": true, "dest": "/tmp/node_exporter-linux-arm64/1.8.2", "extract_results": {"cmd": ["/usr/bin/tar", "--extract", "-C", "/tmp/node_exporter-linux-arm64/1.8.2", "-z", "--show-transformed-names", "--strip-components=1", "-f", "/root/.ansible/tmp/ansible-tmp-1730343705.1365232-1179-87748719038915/source"], "err": "", "out": "", "rc": 0}, "gid": 0, "group": "root", "handler": "TgzArchive", "mode": "0755", "msg": "Unexpected error when accessing exploded file: [Errno 2] No such file or directory: b'/tmp/node_exporter-linux-arm64/1.8.2/node_exporter'", "owner": "root", "size": 81, "src": "/root/.ansible/tmp/ansible-tmp-1730343705.1365232-1179-87748719038915/source", "state": "directory", "uid": 0}
changed: [54.212.243.52 -> localhost]
changed: [35.91.104.132 -> localhost]

It's probably less than 1% of total extraction executions, but with many hosts the overall failure rate is quite high.

We delegate the download & extraction to localhost, so this runs in parallel N times for N hosts but the paths are the same in every case, so it's concurrently extracting to the same destination path which I guess causes this, e.g, from the manual:

When extracting files, if tar discovers that the extracted file already exists, it normally replaces the file by removing it before extracting it, to prevent confusion in the presence of hard or symbolic links.

So there will be brief periods where a written file (in this case, the top level dir) will be deleted and replaced by a concurrent extraction and if some earlier writer tries to list it it will see it missing.

Besides the race, it's also inefficient to extract the same file N times.

I guess run_once would fix this if it wasn't for the complication of multi-arch. Maybe serial execution + skip if already exists would work with little refactoring of the existing approach?

I think this race existed before the big refactor in 0.20.0, but in the refactor list_files was added which may be the thing that triggers the specific failure above.

The text was updated successfully, but these errors were encountered:

travisdowns · 2024-11-08T14:00:37Z

Actually maybe run_once would work? Because the TASK text above indicates that it is called for a single binary (arm64 here), so maybe all the per-host invocations will have the same result?

gardar · 2024-11-13T00:37:23Z

The problem with run_once is that it will take the ansible_architecture fact from the first host in the play and then skip the rest of the hosts. So if you have multiple architectures in your play you will not get the binaries for them.

But a possible solution to this issue might be to use the creates parameter of the unarchive module as it would make the extraction skip completely if the binaries already exist.

travisdowns · 2024-11-15T02:05:40Z

The problem with run_once is that it will take the ansible_architecture fact from the first host in the play and then skip the rest of the hosts. So if you have multiple architectures in your play you will not get the binaries for them.

Right, but what I'm confused about is that the task title already embeds the architecture as in my log above:

TASK [prometheus.prometheus._common : Unpack binary archive node_exporter-1.8.2.linux-arm64.tar.gz] ***

It already says arm64, so is the arch not already baked in? If I had a mix of host types, would this title include multiple binaries? I'm only a casual observer though, so don't pay this too much heed.

But a possible solution to this issue might be to use the creates parameter of the unarchive module as it would make the extraction skip completely if the binaries already exist.

This still seems racy maybe? First host starts unpacking then this root specified in creates dir is going to exist immediately, but the unpacking will be ongoing, then second host come in right after, see the root existing and proceeds to skip the task. I guess that's fine with the default execution strategy which holds all hosts from proceeding until all are done the task ("linear"), but if someone is using strategy: free the 2nd host could proceed and use invalid unpacked binaries.

Maybe it should be creates + serial: 1?

travisdowns · 2024-12-20T15:38:08Z

It already says arm64, so is the arch not already baked in? If I had a mix of host types, would this title include multiple binaries? I'm only a casual observer though, so don't pay this too much heed.

To answer my own question, yes this task does run with different variable values for different architectures, but the output in the task name just appears to use the variable names from the first host, so it simply appears that the binaries for one architecture is downloaded while under the covers we are downloading for all architectures.

During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457.

During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457. Signed-off-by: Travis Downs <[email protected]>

travisdowns · 2024-12-20T16:50:08Z

I have pushed a patch that fixes this issue.

travisdowns linked a pull request Dec 20, 2024 that will close this issue

fix: use throttle: 1 for localhost unarchive #498

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent local extraction of binaries fails sometimes #457

Concurrent local extraction of binaries fails sometimes #457

travisdowns commented Nov 8, 2024 •

edited

Loading

travisdowns commented Nov 8, 2024

gardar commented Nov 13, 2024

travisdowns commented Nov 15, 2024

travisdowns commented Dec 20, 2024

travisdowns commented Dec 20, 2024

Concurrent local extraction of binaries fails sometimes #457

Concurrent local extraction of binaries fails sometimes #457

Comments

travisdowns commented Nov 8, 2024 • edited Loading

travisdowns commented Nov 8, 2024

gardar commented Nov 13, 2024

travisdowns commented Nov 15, 2024

travisdowns commented Dec 20, 2024

travisdowns commented Dec 20, 2024

travisdowns commented Nov 8, 2024 •

edited

Loading