Add caching functionality (v2) #443

mre · 2022-01-04T22:43:58Z

A while ago, caching was removed due to some issues (see #349).

This is a new implementation with the following improvements:

Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

TODO

Open points that evolved during our discussion:

Disable writing cache to disk by default. Add a --cache flag to enable.
Don't cache filesystem access.
~~Support different cache formats. E.g. a compact format using bloomfilters/xor filters.~~ Can't get it to work, see Add caching functionality (v2) #443 (comment).

Follow-up after merge

Add caching example to Github action.

Future improvements

Requests for the same URI can get fired in parallel, resulting in the URI getting checked multiple times — rendering the cache worthless. I think this is an acceptable trade-off for the moment because it allows us to avoid additional locking/synchronizing. In the future we might add website-based sharding. This will also help to avoid overloading the server with requests and can be nicely combined with rate-limiting.
The request handler could be refactored to use a custom#[cache] attribute.
We can track the number of duplicate links and show the info when --format=detailed is activated.

Fixes #348

@MichaIng @lebensterben @pawroman @untitaker fyi

MichaIng · 2022-01-04T23:36:55Z

Do I see it right, that the cache is stored as dedicated file .lycheecache automatically? IMO tools like lychee shouldn't automatically create files on the filesystem, if no explicit flag was set, especially since it's unknown whether write permissions are actually given 😉. It should be possible to keep it memory for a single lychee call only by default, and have a dedicated file (which is indeed nice when multiple calls are done) created only with an opt-in flag?

lebensterben · 2022-01-05T01:19:54Z

lychee-bin/src/cache.rs

+pub(crate) type Cache = DashMap<Uri, CacheStatus>;
+
+pub(crate) trait StoreExt {
+    fn store<T: AsRef<Path>>(&self, path: T) -> Result<()>;


Since this is internal API, we don't need to make it generic (unless it's necessary).
I guess we can use &Path instead.

The potential benefit is faster compilation.

True. Although I "benchmarked" it with a few compiles and I couldn't find a difference. Probably because the rest takes longer. 😆

lebensterben · 2022-01-05T01:33:16Z

IMO tools like lychee shouldn't automatically create files on the filesystem, if no explicit flag was set, especially since it's unknown whether write permissions are actually given

Agreed.

lebensterben · 2022-01-05T01:38:16Z

Storing cache result in CSV seems a straightforward choice, since we've mature and performant serializer/deserializer and it's human readable.

But if we choose some other format such as apache arrow, it can be faster and the non-readability may even be desired by some users.

untitaker · 2022-01-05T13:08:41Z

After a quick glance I don't understand how cache invalidation works. You're invalidating on a per-file level, but the cached requests can be really old, no?

I agree that the caching is a surprising default. If the caching is automatically set up to work in the gh action, it could be nice to have enabled by default though

in CI, filesize matters, as the cache is downloaded/uploaded. Perhaps you can store bloom filters instead.

I would never cache filesystem access (not sure if you're doing that)

otherwise lgtm

mre · 2022-01-05T14:15:54Z

After a quick glance I don't understand how cache invalidation works. You're invalidating on a per-file level, but the cached requests can be really old, no?

Yes indeed. We could add an individual timestamp for each cached request. It would be quite repetitive I guess. Any alternative ideas?

If the caching is automatically set up to work in the gh action, it could be nice to have enabled by default though

Good idea. We should enable it and add an example on how to use caching with the cache action.

in CI, filesize matters, as the cache is downloaded/uploaded. Perhaps you can store bloom filters instead.

Yeah that's true and I agree with you and @lebensterben here. We should make the cache generic. In the future I would also like to support Redis for cluster setups. 👍 We'll probably start with CSV and maybe bloom filters / xor filters and later add more variants.

I would never cache filesystem access (not sure if you're doing that)

Yeah I do, but I shouldn't. FS access is pretty fast and volatile so that was an oversight on my end. Nice catch!

Thanks for all the great feedback @MichaIng, @lebensterben, and @untitaker. ❤️

Conclusions so far:

Disable writing cache to disk by default. Add a --cache flag to enable.
Support different cache formats. E.g. a compact format using bloomfilters/xor filters.
Don't cache filesystem access.
Follow-up: Add caching example to Github action.

I'll add them as TODOs to the PR description.

untitaker · 2022-01-05T14:28:16Z

Yes indeed. We could add an individual timestamp for each cached request. It would be quite repetitive I guess. Any alternative ideas?

if stored as bloom filter, you need to store a bloom filter per (status, timestamp) tuple. if timestamp is rounded down to days or weeks, this might be feasible

Good idea. We should enable it and add an example on how to use caching with the cache action.

I think the entire performance tradeoff will hinge on CI, as having a cache in CI is really costly.

I don't think it's necessary to support storing the cache in multiple formats or data stores. what do you get from redis support? do you want to do partial cache updates? then use a cache dir. why support both bloom and xor? is there a tradeoff I'm missing that prevents you from picking the universally best format+datastore?

mre · 2022-01-05T14:37:08Z

what do you get from redis support?

The idea was that if you run a cluster of lychees you could have a shared cache. This can be helpful when lychee is running behind an API endpoint, which I'm planning to do.

do you want to do partial cache updates? then use a cache dir.

Yeah, partial updates. You might be right actually. We should try to get as far as possible with simple file-based caching.

why support both bloom and xor? is there a tradeoff I'm missing that prevents you from picking the universally best format+datastore?

No I meant either bloom or xor, not both. 😉 On top of that I would still like to keep CSV because removing cache entries can be done with a simple editor and it's great for debugging (and compresses well). Does that make sense?

mre · 2022-01-05T14:42:20Z

The only issue with CSV is the redundant timestamp storage.
We could get around that by using a slightly more sophisticated format like TOML, which is still human-readable.

[[success]]
time = 1979-05-27T07:32:00-08:00
requests = [ ... ]

[[success]]
time = 1980-05-27T07:32:00-08:00
requests = [ ... ]

But then it's not easily "sed-able" anymore. Is there a different format that comes to mind?

untitaker · 2022-01-05T14:49:21Z

The idea was that if you run a cluster of lychees you could have a shared cache. This can be helpful when lychee is running behind an API endpoint, which I'm planning to do.

that's fair. it's still possible to do it with filesystem (see how clustering with afl works), but redis seems much easier.

mre · 2022-01-06T11:46:54Z

@untitaker, your comment got me thinking:

if stored as bloom filter, you need to store a bloom filter per (status, timestamp) tuple. if timestamp is rounded down to days or weeks, this might be feasible

For CSV, we could store the cache files like so

.lycheecache
├── 2022-01-05T07:32:00-08:00.csv
└── 2022-01-05T08:32:00-08:00.csv

The upside would be that we only have to store the timestamp once — in the filename. I think that was also @MichaIng's idea IIRC. The downside would be that we have to manage multiple cache files and think about cleaning them up to not waste disk space. Also we'd have to merge the files on startup, so maybe it's more trouble than it's worth.

MichaIng · 2022-01-06T12:01:08Z

Note that due to block sizes, usually multiple files are magnitude of orders larger than redundant timestamps. Not sure if this is a big issue to have some KiB more?

mre · 2022-01-06T12:14:10Z

I'm okay with adding the timestamps to every entry. If we use Unix timestamps, they are quite compact anyway. If we use full RFC3339 dates, then they are a little longer, but also human-readable. Then again I don't think many people will edit the file manually, let alone based on the request time, so it shouldn't matter.

untitaker · 2022-01-06T15:11:37Z

how about a single file that contains tuples/csv rows:

(status, timestamp * NUM_BUCKETS / cache_expiry_seconds, [xor filter of urls])

cache_expiry_second is a user-defined value to determine how old cache items can be
timestamp is unix timestamp in seconds
NUM_BUCKETS is a hardcoded value that will be important later

a cache row is expired when (timestamp * NUM_BUCKETS / cache_expiry_seconds) (i.e. the second column) is smaller than (current_timestamp - cache_expiry_seconds) * NUM_BUCKETS / cache_expiry_seconds

since cache_expiry_seconds controls how the second row is formatted, the setting has to be written to the cache file. if lychee finds that the setting in the cache file is different from its own config, it can either 1) throw away the cache file 2) do more math to recover the cache file. on the upside, the second column is now much smaller than a timestamp

this puts a theoretical upper bound on the number of rows: there can only be NUM_BUCKETS rows. we know how large each row is, depending on whether we use fixed-width integers and xor filters, we know either an upper bound or the exact row size. so we know either the maximum or exact file size.

this is not for the sake of microoptimization, but to get rid of the file-size question. because things are now microoptimized, you can now choose NUM_BUCKETS freely to define the file size (or, again, an upper bound). you could make it dependent on file system block size, or make it a user setting, however then you also need to write it into the file to do proper cache invalidation (similarly to cache_expiry_second)

lebensterben · 2022-01-06T22:47:38Z

Since you mentioned timestamp and cluster, it's worth mentioning that time format such as

2022-01-05T07:32:00-08:00
isn't robust.

Epoch time in UTC is not location specific, but is not human friendly.
ISO 8601 time + time zone seems better but is not robust because sometimes a location changes its time zone which results in local time discontinuity.

mre · 2022-01-06T23:20:25Z

Guess a UNIX timestamp would be the easiest as it's guaranteed to be UTC.

mre · 2022-01-12T22:28:19Z

Question: if --cache is not set but the .lycheecache file exists, should we still load it? I guess not, but I wanted to hear people's thoughts.

untitaker · 2022-01-12T22:30:39Z

I think the existence of --cache (an option to pass cache path) pretty much implies there's no magic filenames that lychee picks up automatically

…

On Wed, Jan 12, 2022, at 23:28, Matthias wrote: Question: if `--cache` is *not* set but the `.lycheecache` file exists, should we still load it? I guess not, but I wanted to hear people's thoughts. — Reply to this email directly, view it on GitHub <#443 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRIKEPJGA2ZTWPUY67TUVX6A3ANCNFSM5LIQVLSQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

MichaIng · 2022-01-12T22:36:04Z

I think the existence of --cache (an option to pass cache path) pretty much implies there's no magic filenames that lychee picks up automatically

--cache is a flag, telling lychee to use .lycheecache in current working dir (?). And I agree that without this flag set, .lycheecache should be neither written nor read 👍.

mre · 2022-01-12T23:53:07Z

Added the timestamp functionality we talked about. @untitaker, your approach is superior and quite clever, but I decided to keep it simple for now and just add a timestamp to every entry. It's wasteful, but quite easy to explain and most people wouldn't use a CSV file anyway if they had space constraints.

As for the bloomfilter / xor filter, I quite frankly didn't get far.

StoreExt would have to be modified to have a

fn load<T: AsRef<Path>>(path: T, max_age_secs: u64) -> Result<dyn Cache>;

however the trait object doesn't have a size. impl trait is not allowed in traits at the moment and I don't want to introduce a Box here to avoid the runtime overhead. I also don't think it's very elegant to use a Box here.
Played around with an associated type like in the Handler in this blog post about tower, but in the end it always wants a concrete type from me. ¯\_(ツ)_/¯
If someone has an idea I'd welcome PRs. 😆

Other than that I'd say we could merge this from my side. It's an optional feature anyway, so let's not sweat it. If anyone has the time to give it a spin, I'd be happy for any final comments.

lebensterben · 2022-01-13T01:45:38Z

looks good enough to merge. no need to finish all features in this PR.

why this gets recommended now is beyond me

lebensterben reviewed Jan 5, 2022

View reviewed changes

mre force-pushed the cache-3 branch from 1ac3c85 to e8e7a53 Compare January 12, 2022 22:00

mre mentioned this pull request Jan 13, 2022

Response caching? #163

Closed

mre added 7 commits January 14, 2022 15:14

First version

2939f77

fixes

3f1dd9d

Refactor

59b5114

docs

d0a42f3

clippy

d968d43

docs

3ae4b96

fix tests

3e97983

mre added 19 commits January 14, 2022 15:14

re-enable caching test

889ef99

cleanup

ab2188c

Update doc comment

33e3315

Interpolate cache/exclude file names for help messages

cb88c8d

Refactor request handling

b3f873a

Add max cache duration (with default of 1 day)

e10745c

Only store cache if not disabled

0f6fab2

Update missing link to https

d46fb01

Add error conditions for check

2166728

fmt

22a0f1f

lint

dcb447e

add no-cache to failing tests

259c22f

Avoid caching filesystem requests

643ddf7

Disable cache by default

6e29701

update docs

193d45a

Add timestamps to cache entries

016b06f

Add From<&Status> for CacheValue

aa8bb5a

add must_use

d90728c

why this gets recommended now is beyond me

Mention caching in README

1ac4f82

mre force-pushed the cache-3 branch from 51650c9 to 1ac4f82 Compare January 14, 2022 14:15

mre merged commit ac490f9 into master Jan 14, 2022

mre deleted the cache-3 branch January 14, 2022 14:25

tooomm mentioned this pull request Jul 16, 2022

Add cache example to README lycheeverse/lychee-action#146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add caching functionality (v2) #443

Add caching functionality (v2) #443

mre commented Jan 4, 2022 •

edited

Loading

MichaIng commented Jan 4, 2022 •

edited

Loading

lebensterben Jan 5, 2022

mre Jan 13, 2022

lebensterben commented Jan 5, 2022

lebensterben commented Jan 5, 2022

untitaker commented Jan 5, 2022

mre commented Jan 5, 2022

untitaker commented Jan 5, 2022 •

edited

Loading

mre commented Jan 5, 2022

mre commented Jan 5, 2022 •

edited

Loading

untitaker commented Jan 5, 2022

mre commented Jan 6, 2022

MichaIng commented Jan 6, 2022

mre commented Jan 6, 2022

untitaker commented Jan 6, 2022 •

edited

Loading

lebensterben commented Jan 6, 2022

mre commented Jan 6, 2022

mre commented Jan 12, 2022

untitaker commented Jan 12, 2022 via email

MichaIng commented Jan 12, 2022 •

edited

Loading

mre commented Jan 12, 2022

lebensterben commented Jan 13, 2022

Add caching functionality (v2) #443

Add caching functionality (v2) #443

Conversation

mre commented Jan 4, 2022 • edited Loading

TODO

Follow-up after merge

Future improvements

MichaIng commented Jan 4, 2022 • edited Loading

lebensterben Jan 5, 2022

Choose a reason for hiding this comment

mre Jan 13, 2022

Choose a reason for hiding this comment

lebensterben commented Jan 5, 2022

lebensterben commented Jan 5, 2022

untitaker commented Jan 5, 2022

mre commented Jan 5, 2022

untitaker commented Jan 5, 2022 • edited Loading

mre commented Jan 5, 2022

mre commented Jan 5, 2022 • edited Loading

untitaker commented Jan 5, 2022

mre commented Jan 6, 2022

MichaIng commented Jan 6, 2022

mre commented Jan 6, 2022

untitaker commented Jan 6, 2022 • edited Loading

lebensterben commented Jan 6, 2022

mre commented Jan 6, 2022

mre commented Jan 12, 2022

untitaker commented Jan 12, 2022 via email

MichaIng commented Jan 12, 2022 • edited Loading

mre commented Jan 12, 2022

lebensterben commented Jan 13, 2022

mre commented Jan 4, 2022 •

edited

Loading

MichaIng commented Jan 4, 2022 •

edited

Loading

untitaker commented Jan 5, 2022 •

edited

Loading

mre commented Jan 5, 2022 •

edited

Loading

untitaker commented Jan 6, 2022 •

edited

Loading

MichaIng commented Jan 12, 2022 •

edited

Loading