Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Using Etag + Last-Modified headers to detect new releases #442

Open
Arcitec opened this issue Oct 26, 2024 · 20 comments
Open

[Feature] Using Etag + Last-Modified headers to detect new releases #442

Arcitec opened this issue Oct 26, 2024 · 20 comments

Comments

@Arcitec
Copy link

Arcitec commented Oct 26, 2024

I can't find this on the project page or in search engines.

Sometimes, the upstream publishes a link to a single file which changes on the server, but doesn't publish the version number on a website anywhere. So there's no version number to scrape.

What can be done in that case?

The ideal scenario for me would be:

  • Look at the ETag if the server sends it.
  • Otherwise look at Last-Modified.
  • Otherwise look at Content-Length, the file's size. This is the least reliable since it's possible two different versions can have the exact same size. But it's INSANELY UNLIKELY that two versions would have the exact same size down to the exact byte length, so it's a good fallback check method too.
  • If neither of those headers exist, abort with failure.
  • If a new file ETag/Last-Modified/Content-Length was detected, download it, calculate the file checksum, and tag it as a new release. Then update the Flatpak manifest.
  • As for versioning... uh... maybe use the Last-Modified date in UTC/GMT timezone as the version? Like 2024-10-23 (formatted without v or . periods, to not be confused with real verson numbers)? That would be super helpful for identifying what version someone is running!

This is the metadata that the new "type": "http-header-check", would need to keep track of:

  • Which type of detection was used during the last check: etag > last-modified > content-length (only one of them, but prioritized in that order).
  • Checksum of the fully downloaded file last time.
  • Version number based on Last-Modified as UTC if that was available in the headers, formatted as 2024-10-23. Could be disabled/enabled via "auto-date-version": true, if someone doesn't want that feature. But there's really no other way to get versions from unversioned URLs, unless we do analysis of the package contents (such as what your rotating-url does for AppImages).

Here's an example URL that would need this kind of detector:

https://launcher.mojang.com/download/Minecraft.tar.gz

They do not publish its version number anywhere:

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

Here are the current headers being given by https://launcher.mojang.com/download/Minecraft.tar.gz:

HTTP/1.1 200 OK
Date: Sat, 26 Oct 2024 15:18:16 GMT
Content-Type: application/x-gzip
Content-Length: 1102510
Connection: keep-alive
Last-Modified: Wed, 09 Oct 2024 21:12:36 GMT
ETag: 0x8DCE8A71945C432
x-ms-request-id: 3bc547b0-701e-0059-6484-273c69000000
x-ms-version: 2009-09-19
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
x-azure-ref: 20241026T151816Z-1569d8b7f85s7m6mdz0k7pzgcc00000001h000000001v75g
Cache-Control: public, max-age=1209600
x-fd-int-roxy-purgeid: 78064276
X-Cache: TCP_HIT
X-Cache-Info: L1_T2
Access-Control-Allow-Origin: *
Accept-Ranges: bytes

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

You can use the rotating url checker https://github.com/flathub-infra/flatpak-external-data-checker?tab=readme-ov-file#url-checker

Don't set a pattern, just set the url to the unversioned download link and checksums will be automatically updated.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

@bbhtt Ah, thank you for the help.

At first, rotating-url seemed to only be intended for URLs that redirect to the real file, but the Minecraft launcher is not a redirect (they just replace the file itself on the server every time they release a new version).

The description of that checker says "If the upstream vendor has an URL that redirects to the latest version of the application".

But I am not sure about this... It says that it extracts the version number from AppImages or URLs. But what does it do in my case when neither of those exist?

It also has the issue that it downloads the entire file every time to check the hash, which causes problems for other people.

Would it be possible to implement a "type": "http-header-check", instead, which runs the algorithm in my first post? That would be the most efficient update check method, and it would only need to download files for checksumming if they've really changed, and would also calculate a version number/identifier.

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

No it can be a fully static URL that points to the latest version always also. It may redirect or may not.

About versioning, version is extracted from the Url pattern. But since you have none it will not work.

But why do you need to set an arbitrary time based version to an unversioned thing? That will confuse people because inside the source there might be a totally different version set which will not match with manifest.

If the distributor sets up unversioned sources the problem is on their end.

It also has the issue that it downloads the entire file every time to check the hash, which causes problems for other people.

There is a PR for it, so wait for it to be fixed.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

No it is a fully or partially static URL that points to the latest version always. It may redirect or may not.

Ah right, that's a nice feature then. Unfortunately it always downloads the file to checksum it every time. For Minecraft it's only 1 megabyte, but for other projects it could be half a gigabyte or more, which slows down your Flathub infrastructure.

An implementation of "http-header-check" would solve that. I edited the first post with a bit more details about how that could work.

About versioning, version is extracted from the Url pattern. But since you have none it will not work.

But why do you need to set an arbitrary time based version to an unversioned thing? That will confuse people because inside the source there might be a totally different version set which will not match with manifest.

If the distributor sets up unversioned sources the problem is on their end.

Think of it like the "this is the Flatpak release date/version" rather than the "app version". It's a good solution for unversioned upstream files. It's not really that confusing, and it's much better than no version info at all. :D

It could be set via a flag, like "auto-date-version": true,. And if the URL is a .AppImage it could still use the rotating-url technique of extracting the real version from the file itself.

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

Having false version info is worse than having none, at least with none people will know to look for it elsewhere.

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

If they get added to app data, people will start reporting bugs or asking about non existent versions.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

Having false version info is worse than having none, at least with none people will know to look for it elsewhere.

That's a matter of personal opinion. But date-based auto-versioning should definitely be an opt-in flag that is off by default. And static extraction such as .AppImage details should be used and take precedence when possible.

If they get added to app data, people will start reporting bugs or asking about non existent versions.

Nah, I think that worry is way overblown. If I download "v2024.10.23" from Flathub and the app itself is "v1.0.3", I won't think "oh Flatpak is bugged". I'll think "Okay the app was released on "2024-10-23 and it is v1.0.3". Let's not underestimate the intelligence of Linux users. :D

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

vYYYY are actual version strings often used by projects, it's a problem to generate random versions.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

vYYYY are actual version strings often used by projects, it's a problem to generate random versions.

Yeah some projects use vYYYY.MM.DD format, which is why it should be an opt-in feature and be off by default so that it's not used where inappropriate.

Now let's turn the argument around completely for a moment to show it from the other perspective:

  • Your concern is that v2024.10.23 is bad and will lead to bug reports if the app version is actually v1.0.3. I don't think Linux users are that unintelligent, but either way...
  • If automatic versioning is not possible, then a package would forever be saying it's v1.0.0 on Flathub while really being v5.3.7 internally, and that is BY FAR a bigger annoyance and source of confusion, making people think they are looking at an extremely outdated Flatpak which is not worth downloading at all from Flathub since it's ancient.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

By the way, is there another date format that could be used for automatic versioning which doesn't look like real app version numbers?

Perhaps the version can be set to 2024-10-23 without a v in front and no . periods, to make it clearer that it's an auto-generated date and not a version.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

And considering that a large portion of Flatpaks are in low-maintenance mode, where the owner basically just auto-merges auto-checker PRs and doesn't maintain the metadata manually, we cannot expect everyone to be manually fixing the latter problem (the "forever v1.0.0, oh my god Flathub is absolutely ancient" issue). Not to mention the hassle of having to fix that post-merge/or via manually editing every single bot merge request, triggering lots of extra Flatpak builds and releases with version numbers that jump around before being fixed, if the Flatpak maintainer is even around to fix it at all. People leave, die, suffer natural disasters, go on vacation, etc.

But it seems to me like the format 2024-10-23 would fix the concerns and no longer be confused with version numbers. Of course, augmented with automatic version extraction (which takes precedence) if the file is an .AppImage, or any other recognizable format.

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

I'm not going to go on about the versioning. That seems like a bad idea to me and too much complexity for very little gain and imo makes things worse.

Also re checking Etag/Last modified, how would x-chexker know what the last value was? It runs stateless, it cannot know the previous header values to know if a page changed or do a comparison with current values.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

I'm not going to go on about the versioning. That seems like a bad idea to me.

Yeah. I think not having dates is a bad idea since it leads to great confusion and makes Flathub's static version number look ancient/outdated, leading people to not install the Flatpak at all. But let's not talk about it for now, and focus on the implementation details you brought up. :)

Also re checking Etag/Last modified, how would x-checker know what the last value was? It runs stateless, it cannot know the previous header values to know if a page changed.

x-checker stores lots of state in the manifest already, "sha256", "size", "url", "tag", "commit". It stores the last result/"state" there, and even adds the fields if they are missing.

Of course, it seems to only be using fields that are reserved by Flatpak itself already, but if any extra fields are needed beyond what can already be stored, I would suggest extending "x-checker-data": {} with a sub-dictionary that stores extra state. Storing it as a sub-dictionary keeps it away from the main Flatpak manifest metadata.

Only used when a checker needs to know about extra state, of course. This might be the only checker that would need that at the moment, but it still seems like something that could be a useful feature for x-checker, just to have the option of remembering important state such as what would be needed for this "fast header comparison" update check method.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

One strong motivation for adding an "extra state" dictionary to x-checker is that solves the Flathub server infrastructure strain that the current rotating-url checker causes. If an upstream app is 2 GB, Flathub's server has to download that full data every single time to be able to hash it. If it instead compares "old vs new HTTP information headers", it could do the update check with only a few bytes of data transfer. In fact, that's the exact purpose of the ETag and Last-Modified HTTP headers: Checking for changes without needing to download anything if nothing changed.

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

Flathub server does not download them, it runs on Github actions and they have a large bandwidth. 90% of the time the tool is going to be run in CI where this is not a problem.

Anyway optionally extending extending x-checker-data with a new field might be possible.

@Arcitec
Copy link
Author

Arcitec commented Oct 26, 2024

it runs on Github actions

Ahh okay! Glad to hear that. Microsoft is paying for it then. :')

Anyway optionally extending extending x-checker-data with a new field might be possible.

How does something like this look?

{
  "sources": [
    {
      "type": "extra-data",
      "filename": "Minecraft.tar.gz",
      "url": "https://launcher.mojang.com/download/Minecraft.tar.gz",
      "sha256": "cd9f0b44fc9cec42829cb2e71145ee599f3d34c7715b55963514d0a8d36214ab",
      "size": 1102510,
      "x-checker-data": {
        "type": "http-headers",
        "url": "https://launcher.mojang.com/download/Minecraft.tar.gz",
        "checker-metadata": {
          "header-type": "ETag",
          "header-value": "0x8DCE8A71945C432"
        }
      }
    }
  ]
}

Possible alternative names:

  • checker-state
  • checker-meta
  • But I think checker-metadata is the clearest name.

Algorithm for the "http-headers" checker:

  • Perform a HTTP HEAD request to only get the headers. Follow all redirects to reach the actual file, since some servers throw you around between a bunch of CDN redirects!
  • Look for the exact same header as last time, if "checker-metadata" exists.
    • If that header exists in the new HTTP response, compare the values.
      • If they differ, treat it as a changed file.
      • Otherwise, abort since no update exists.
      • Edit: Actually... if a BETTER-quality header now exists, then it should immediately upgrade the checker-metadata to the better header and its new value, and download the file itself again to ensure the hash and filesize is still the same. So that it would upgrade from Content-Length -> Last-Modified -> ETag in that order (later values in that list = stronger quality), when it detects that the server has improved its header responses.
    • Otherwise, if that exact header is now missing from the new HTTP response, treat it as a potentially changed file.
      • Download it again to compare hashes.
      • Then update the checker-metadata to the new header's value (such as demoting from using ETag to Last-Modified instead).
      • Trigger a manifest update even if there's no actual version update. To prevent having to re-hash the file due to missing previous headers constantly every time the update check runs.
  • Checking for headers is always done in this order, prioritizing the first that is found:
    • ETag because it's the strongest field for checking versions of files.
    • Last-Modified which is the file modification timestamp on the server.
    • Content-Length which is the file size. Due to the fact that most applications are compressed to save server storage space, even a single byte of difference in the original file will lead to a different filesize due to restructuring the archive's internal dictionary to accomodate the new byte differences, so it's a very reliable check, but not as reliable as the others. I suggest having a "disable-content-length": true flag that can disable this check if it's inappropriate for some application. But the chances that two versions of an app will have the same size is basically 0%, because every new version of an app adds/changes things in the code, so the file size will differ.
  • If none of the headers are found, the checker shall emit an error and fail.
  • If an update is found, update this manifest metadata:
    • "url"
    • "sha256"
    • "size"
    • "x-checker-data" -> "checker-metadata": Store the HTTP header name and value that was used during this check.

As for the disputed feature of generating a 2024-10-23 "version-ish" tag from Last-Modified (or from the Flatpak build date, perhaps): Are there any alternatives to doing that? What happens if a Flatpak manifest doesn't provide a version at all? Is there some other mechanism for Flathub and client apps to show when the application's Flatpak was built instead of having a version number?

Thinking about it again, I actually think that basing it on the Flathub build date (instead of Last-Modified) and using 2024-10-23 (no . or v prefix) is sufficient to differentiate it from actual versions, and still provide very valuable information about which Flatpak version a user is running, and still showing the users that the Flatpak is being updated.

Actually, I even see that Freedesktop sets their version to freedesktop-sdk-23.08.24 (they use YY.MM.DD), so we can clearly add more to the field. We could use flatpak-2024-10-23 as the version, for example. Then there's no potential for confusion even among easily confused people. :')

As for multiple releases on the same day: Not really an issue. The flatpak update check will still download the updates. But if wanted, it could increment some extra info such as flatpak-2024-10-23-build2, etc, when that scenario happens.

And of course: Always prioritize .AppImage version metadata and other extraction methods over automatic generation. Same as what the rotating-url checker is currently doing for formats it recognizes. And automatic generation would be a manual opt-in via a flag in "x-checker-data" -> "use-date-as-version": true,.

This is the only checker that would need that feature, since there's no extractable version information anywhere else.

@bbhtt
Copy link
Contributor

bbhtt commented Oct 26, 2024

No need to introduce a new type. This currently only makes sense for rotating URLs which are static. The checkers already skip any downloads if the URLs remain unchanged.

@Arcitec
Copy link
Author

Arcitec commented Oct 27, 2024

Yeah, I was considering proposing it as an improvement for rotating-url instead, but wasn't sure if it makes sense to mix the algorithms.

But I am sure there's some neat way to merge the two different update-check behaviors, such as:

  • If pattern is included, just look at the new filename and extract the next version's URL that way. Quick and easy. No need to download anything unless the URL has changed.
  • If no pattern is included, we are in "static URL mode" which doesn't know the version or whether the URL points to an update, so automatically use the header algorithm instead. Which is an improvement over the old "download and hash the whole file every time". So it makes sense to add this improvement to rotating-url checker.
  • In "static URL mode", we can leverage rotating-url's existing AppImage version check support. And as an opt-in fallback when there's zero versioning info available, I would really love something like the proposed flatpak-2024-10-23[-build3] tagging, or another naming scheme that makes sense.

So yeah, it seems like it would be very neat to merge these two algorithms. 👍

@bbhtt
Copy link
Contributor

bbhtt commented Oct 27, 2024

The first one is also how it works currently. The version is extracted from the pattern if it exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants