Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate checksum after successful import #627

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

gndrmnn
Copy link
Contributor

@gndrmnn gndrmnn commented Aug 10, 2023

Use the new 'verify_checksum' hash value in the yaml files to verify the image integrity after it has been successfully imported. Show a warning, if either the hash algorithm or the hash value does not match the expected fields.

  • Deactivate broken image / deactivate cycle process
  • unit test

The SHA512 values used here are computed by the CI pipeline if and only if the image is updated.

Fixes #340

@gndrmnn gndrmnn force-pushed the fix_340 branch 3 times, most recently from adab6da to 4b8a8a3 Compare August 10, 2023 10:00
@gndrmnn gndrmnn marked this pull request as draft August 10, 2023 13:01
@frittentheke
Copy link

frittentheke commented Aug 11, 2023

With the use of the web-download mechanism of Glance images are usually not downloaded by the image-manager but by Glance itself. I'd refrain from adding any code that would download images and requiring enough bandwidth or even disk space to hold them temporarily.

@gndrmnn did you see my comment at #340 (comment) ? in short: Glance does already produce hash values, it's just that this is done with a single fixed hash algo for all images and it cannot be chosen per image.

I suggest to try and extend the import call of the image API (https://docs.openstack.org/api-ref/image/v2/index.html?expanded=import-an-image-detail#import-an-image) to optionally give it a hash algo to use for the checksum. Then you get the image downloaded by Glance and the hash created by Glance and can simply compare it by fetching the image details.

@gndrmnn
Copy link
Contributor Author

gndrmnn commented Aug 14, 2023

@frittentheke Hi,

With the use of the web-download mechanism of Glance images are usually not downloaded by the image-manager but by Glance itself. I'd refrain from adding any code that would download images and requiring enough bandwidth or even disk space to hold them temporarily.

As far as I understand, the update.py and the corresponding call will only ever be run by a workflow here on GitHub. The change in that file will result in all ymls containing a sha-512 going forward, but they will be "pre-computed" for the average user.

@gndrmnn did you see my comment at #340 (comment) ? in short: Glance does already produce hash values, it's just that this is done with a single fixed hash algo for all images and it cannot be chosen per image.

Yes, I read this. In fact, we compare the (in the future existing) SHA-512 contained in yaml files with the hash that glance computes when importing the image (i think that is how it works?). The imported_image in this call is already the representation of the Image object in openstack, which contains the hash that glance presumably calculated.

I suggest to try and extend the import call of the image API (https://docs.openstack.org/api-ref/image/v2/index.html?expanded=import-an-image-detail#import-an-image) to optionally give it a hash algo to use for the checksum. Then you get the image downloaded by Glance and the hash created by Glance and can simply compare it by fetching the image details.

Yes, that would also be possible. It was discussed with @berendt that we simply recompute the hash values for all image yml files at "image update time". But let's keep the PR on hold until @berendt is back from vacation and gives some input.

@gndrmnn
Copy link
Contributor Author

gndrmnn commented Aug 30, 2023

@frittentheke This was discussed with @berendt and it was decided that we implement this PR as is. The hash values in the yaml files will be pre-computed by the GitHub/Zuul Pipeline and not on the user side. The main advantage of this method is that we can implement it right now.

However, your suggestion to enhance the upstream API can still be applied later as a separate feature. That would allow Glance to throw away any corrupt image before it even "reaches" the backend. I will open a separate issue for that.

See #643

@mbuechse
Copy link
Contributor

mbuechse commented Sep 7, 2023

I'm wondering whether the file should be downloaded AFTER the filename is determined (after the whole HEREBE DRAGONS stuff). Or am I missing something? Also, I would actually keep track of the original hash as well, so that the file need not be downloaded every time the Github action runs. Naturally, I'm volunteering to make the changes I propose, and also to rebase to the outcome of #647 :)

@gndrmnn
Copy link
Contributor Author

gndrmnn commented Sep 7, 2023

I'm wondering whether the file should be downloaded AFTER the filename is determined (after the whole HEREBE DRAGONS stuff). Or am I missing something?

If I understand you correctly, it should make no meaningful difference at which of those two points in time the file is downloaded and hashed.

Also, I would actually keep track of the original hash as well, so that the file need not be downloaded every time the Github action runs

Right, that - or some other mechanism - would make sense to safe some bandwidth. We will consider how to best implement that.

Naturally, I'm volunteering to make the changes I propose, and also to rebase to the outcome of #647 :)

No, it's fine we got this. We might come back to ask some questions about #647 if need be later on though :)

@mbuechse
Copy link
Contributor

mbuechse commented Sep 7, 2023

@gndrmnn

If I understand you correctly, it should make no meaningful difference at which of those two points in time the file is downloaded and hashed.

Maybe I got it wrong myself, but in my understanding, the filename is not even known in advance for certain distros.

The latest_url for CentOS Stream 8, for instance, is given as https://cloud.centos.org/centos/8-stream/x86_64/images/CentOS-Stream-GenericCloud-8-HEREBE\d+\.\dDRAGONS.x86_64.qcow2. When I try do download that, I get a HTTP response code 404. The actual filename has to be determined using the checksums file first to arrive at https://cloud.centos.org/centos/8-stream/x86_64/images/CentOS-Stream-GenericCloud-8-20230904.0.x86_64.qcow2, which can then be downloaded.

@gndrmnn
Copy link
Contributor Author

gndrmnn commented Sep 7, 2023

@gndrmnn

If I understand you correctly, it should make no meaningful difference at which of those two points in time the file is downloaded and hashed.

Maybe I got it wrong myself, but in my understanding, the filename is not even known in advance for certain distros.

The latest_url for CentOS Stream 8, for instance, is given as https://cloud.centos.org/centos/8-stream/x86_64/images/CentOS-Stream-GenericCloud-8-HEREBE\d+.\dDRAGONS.x86_64.qcow2. When I try do download that, I get a HTTP response code 404. The actual filename has to be determined using the checksums file first to arrive at https://cloud.centos.org/centos/8-stream/x86_64/images/CentOS-Stream-GenericCloud-8-20230904.0.x86_64.qcow2, which can then be downloaded.

Ah right, yes. That indeed needs to be taken care of. Thanks for pointing that out.

@gndrmnn gndrmnn force-pushed the fix_340 branch 8 times, most recently from 2b561dd to 513edbc Compare September 21, 2023 10:44
@gndrmnn gndrmnn changed the title Validate provided checksum after successful import Validate checksum after successful import Sep 21, 2023
@gndrmnn gndrmnn requested a review from mbuechse September 22, 2023 13:47
@gndrmnn
Copy link
Contributor Author

gndrmnn commented Sep 22, 2023

@mbuechse I would kindly ask you to review this Merge Request as you are most familiar with the refactored update.py :)

The current iteration of this MR adds an additional field to the yaml schema which contains the sha512 of the actual image. Therefore, we can use the other existing hash to check if a new version was released. Also the filenames are resolved first now. I'd say it makes sense to review the commits individually.

@mbuechse
Copy link
Contributor

@gndrmnn I've been ill. I will certainly have a look one of the next few days.

Copy link
Contributor

@mbuechse mbuechse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By and large good, but with minor formatting issues. Also, I think the additional code path for hashing archives should be revisited if possible.

openstack_image_manager/manage.py Outdated Show resolved Hide resolved
openstack_image_manager/manage.py Outdated Show resolved Hide resolved
openstack_image_manager/update.py Show resolved Hide resolved
openstack_image_manager/update.py Show resolved Hide resolved
openstack_image_manager/update.py Show resolved Hide resolved
else:
assert download_filename not in ["", ".", " ", "/", ".."]
logger.info("Decompressing '%s'" % download_filename)
patoolib.extract_archive(download_filename, outdir=".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity that we need a dedicated code path for the archives. There are several aspects to be considered

  • It is probably "more correct" to hash the extracted file instead of the archive, even though the archive shouldn't change if the content doesn't. The question is if the this "added correctness" is worth the price of the additional code path.
  • We could at least extract the file on-the-fly while downloading it; this is not too hard to achieve with bz2, gzip, and xz; for zip, this might be harder (maybe bordering on impossible). The question is whether zip is really needed here.
  • It should be noted that patool is licensed under GPLv3, which might be a problem; see Check licenses of dependencies #663. If we extract on the fly as suggested above, then patool would be out of the picture anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is probably "more correct" to hash the extracted file instead of the archive, even though the archive shouldn't change if the content doesn't. The question is if the this "added correctness" is worth the price of the additional code path.

We can only compare hashes that the glance backend provides us, which is afaik the extracted hash. So hashing the archive would be pointless

  • We could at least extract the file on-the-fly while downloading it; this is not too hard to achieve with bz2, gzip, and xz; for zip, this might be harder (maybe bordering on impossible). The question is whether zip is really needed here.

Alright i will investigate that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Of course you are right!

openstack_image_manager/update.py Outdated Show resolved Hide resolved
openstack_image_manager/update.py Outdated Show resolved Hide resolved
logger.info(f"Image {name} change detected. Downloading Image...")

headers, verify_checksum, extracted_file = download_and_hash(current_url)
if verify_checksum == None or extracted_file in ["", ".", " ", "/", ".."]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP-8: Do comparisons with None using is or is not instead of == or !=, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, the second condition (extracted_file in ["", ".", " ", "/", ".."]) is really smelly (there must be a more certain way to rule out failure), and I'm also pretty sure that it wouldn't occur in this way anyway, because download_and_hash uses this name to open a file, and that should hopefully fail if the name were any of the things listed).

Use the 'verify_checksum' hash value in the yaml files to
verify the image integrity after it has been successfully
imported. Show a warning, if either the hash algorithm
or the hash value does not match the expected fields.

Fixes osism#340

Signed-off-by: Gondermann <[email protected]>
Since we want to check the hash against the openstack backend,
we need to have the SHA512. Sadly, most images creators do not
provide us with that hash pre-computed. That means we will
compute the SHA512 for every new image update in the CI worker
by downloading the image.

Signed-off-by: Gondermann <[email protected]>
@gndrmnn
Copy link
Contributor Author

gndrmnn commented Sep 26, 2023

Why on earth did the PEP8 issues not trigger the Flake8 test?? 🤔

@mbuechse
Copy link
Contributor

Why on earth did the PEP8 issues not trigger the Flake8 test?? 🤔

Maybe your fork somehow didn't include the CI triggers?

@gndrmnn
Copy link
Contributor Author

gndrmnn commented Sep 26, 2023

Why on earth did the PEP8 issues not trigger the Flake8 test?? 🤔

Maybe your fork somehow didn't include the CI triggers?

Nope, the flake8 test in osism/zuul-jobs did not have the recurse flag enabled for the "Find .py files" job. Which means that in all osism repos only py files located in the repo root are being tested with flake8 :D

Working on a fix fo osism/zuul-jobs now...

@fkr
Copy link

fkr commented Sep 26, 2023

Thanks @mbuechse and thanks @gndrmnn for following up to find why this was not caught by CI. With that result this is a very useful finding.

@gndrmnn
Copy link
Contributor Author

gndrmnn commented Oct 5, 2023

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Glance import task should check supplied checksum of image
4 participants