Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching functionality (v2) #443

Merged
merged 26 commits into from
Jan 14, 2022
Merged

Add caching functionality (v2) #443

merged 26 commits into from
Jan 14, 2022

Conversation

mre
Copy link
Member

@mre mre commented Jan 4, 2022

A while ago, caching was removed due to some issues (see #349).

This is a new implementation with the following improvements:

  • Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
  • Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
  • Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

TODO

Open points that evolved during our discussion:

  • Disable writing cache to disk by default. Add a --cache flag to enable.
  • Don't cache filesystem access.
  • Support different cache formats. E.g. a compact format using bloomfilters/xor filters. Can't get it to work, see Add caching functionality (v2) #443 (comment).

Follow-up after merge

  • Add caching example to Github action.

Future improvements

  • Requests for the same URI can get fired in parallel, resulting in the URI getting checked multiple times — rendering the cache worthless. I think this is an acceptable trade-off for the moment because it allows us to avoid additional locking/synchronizing. In the future we might add website-based sharding. This will also help to avoid overloading the server with requests and can be nicely combined with rate-limiting.
  • The request handler could be refactored to use a custom#[cache] attribute.
  • We can track the number of duplicate links and show the info when --format=detailed is activated.

Fixes #348

@MichaIng @lebensterben @pawroman @untitaker fyi

@MichaIng
Copy link
Member

MichaIng commented Jan 4, 2022

Do I see it right, that the cache is stored as dedicated file .lycheecache automatically? IMO tools like lychee shouldn't automatically create files on the filesystem, if no explicit flag was set, especially since it's unknown whether write permissions are actually given 😉. It should be possible to keep it memory for a single lychee call only by default, and have a dedicated file (which is indeed nice when multiple calls are done) created only with an opt-in flag?

pub(crate) type Cache = DashMap<Uri, CacheStatus>;

pub(crate) trait StoreExt {
fn store<T: AsRef<Path>>(&self, path: T) -> Result<()>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is internal API, we don't need to make it generic (unless it's necessary).
I guess we can use &Path instead.

The potential benefit is faster compilation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Although I "benchmarked" it with a few compiles and I couldn't find a difference. Probably because the rest takes longer. 😆

@lebensterben
Copy link
Member

IMO tools like lychee shouldn't automatically create files on the filesystem, if no explicit flag was set, especially since it's unknown whether write permissions are actually given

Agreed.

@lebensterben
Copy link
Member

Storing cache result in CSV seems a straightforward choice, since we've mature and performant serializer/deserializer and it's human readable.

But if we choose some other format such as apache arrow, it can be faster and the non-readability may even be desired by some users.

@untitaker
Copy link
Collaborator

After a quick glance I don't understand how cache invalidation works. You're invalidating on a per-file level, but the cached requests can be really old, no?

I agree that the caching is a surprising default. If the caching is automatically set up to work in the gh action, it could be nice to have enabled by default though

in CI, filesize matters, as the cache is downloaded/uploaded. Perhaps you can store bloom filters instead.

I would never cache filesystem access (not sure if you're doing that)

otherwise lgtm

@mre
Copy link
Member Author

mre commented Jan 5, 2022

After a quick glance I don't understand how cache invalidation works. You're invalidating on a per-file level, but the cached requests can be really old, no?

Yes indeed. We could add an individual timestamp for each cached request. It would be quite repetitive I guess. Any alternative ideas?

If the caching is automatically set up to work in the gh action, it could be nice to have enabled by default though

Good idea. We should enable it and add an example on how to use caching with the cache action.

in CI, filesize matters, as the cache is downloaded/uploaded. Perhaps you can store bloom filters instead.

Yeah that's true and I agree with you and @lebensterben here. We should make the cache generic. In the future I would also like to support Redis for cluster setups. 👍 We'll probably start with CSV and maybe bloom filters / xor filters and later add more variants.

I would never cache filesystem access (not sure if you're doing that)

Yeah I do, but I shouldn't. FS access is pretty fast and volatile so that was an oversight on my end. Nice catch!

Thanks for all the great feedback @MichaIng, @lebensterben, and @untitaker. ❤️

Conclusions so far:

  • Disable writing cache to disk by default. Add a --cache flag to enable.
  • Support different cache formats. E.g. a compact format using bloomfilters/xor filters.
  • Don't cache filesystem access.
  • Follow-up: Add caching example to Github action.

I'll add them as TODOs to the PR description.

@untitaker
Copy link
Collaborator

untitaker commented Jan 5, 2022

Yes indeed. We could add an individual timestamp for each cached request. It would be quite repetitive I guess. Any alternative ideas?

if stored as bloom filter, you need to store a bloom filter per (status, timestamp) tuple. if timestamp is rounded down to days or weeks, this might be feasible

Good idea. We should enable it and add an example on how to use caching with the cache action.

I think the entire performance tradeoff will hinge on CI, as having a cache in CI is really costly.

I don't think it's necessary to support storing the cache in multiple formats or data stores. what do you get from redis support? do you want to do partial cache updates? then use a cache dir. why support both bloom and xor? is there a tradeoff I'm missing that prevents you from picking the universally best format+datastore?

@mre
Copy link
Member Author

mre commented Jan 5, 2022

what do you get from redis support?

The idea was that if you run a cluster of lychees you could have a shared cache. This can be helpful when lychee is running behind an API endpoint, which I'm planning to do.

do you want to do partial cache updates? then use a cache dir.

Yeah, partial updates. You might be right actually. We should try to get as far as possible with simple file-based caching.

why support both bloom and xor? is there a tradeoff I'm missing that prevents you from picking the universally best format+datastore?

No I meant either bloom or xor, not both. 😉 On top of that I would still like to keep CSV because removing cache entries can be done with a simple editor and it's great for debugging (and compresses well). Does that make sense?

@mre
Copy link
Member Author

mre commented Jan 5, 2022

The only issue with CSV is the redundant timestamp storage.
We could get around that by using a slightly more sophisticated format like TOML, which is still human-readable.

[[success]]
time = 1979-05-27T07:32:00-08:00
requests = [ ... ]

[[success]]
time = 1980-05-27T07:32:00-08:00
requests = [ ... ]

But then it's not easily "sed-able" anymore. Is there a different format that comes to mind?

@untitaker
Copy link
Collaborator

The idea was that if you run a cluster of lychees you could have a shared cache. This can be helpful when lychee is running behind an API endpoint, which I'm planning to do.

that's fair. it's still possible to do it with filesystem (see how clustering with afl works), but redis seems much easier.

@mre
Copy link
Member Author

mre commented Jan 6, 2022

@untitaker, your comment got me thinking:

if stored as bloom filter, you need to store a bloom filter per (status, timestamp) tuple. if timestamp is rounded down to days or weeks, this might be feasible

For CSV, we could store the cache files like so

.lycheecache
├── 2022-01-05T07:32:00-08:00.csv
└── 2022-01-05T08:32:00-08:00.csv

The upside would be that we only have to store the timestamp once — in the filename. I think that was also @MichaIng's idea IIRC. The downside would be that we have to manage multiple cache files and think about cleaning them up to not waste disk space. Also we'd have to merge the files on startup, so maybe it's more trouble than it's worth.

@MichaIng
Copy link
Member

MichaIng commented Jan 6, 2022

Note that due to block sizes, usually multiple files are magnitude of orders larger than redundant timestamps. Not sure if this is a big issue to have some KiB more?

@mre
Copy link
Member Author

mre commented Jan 6, 2022

I'm okay with adding the timestamps to every entry. If we use Unix timestamps, they are quite compact anyway. If we use full RFC3339 dates, then they are a little longer, but also human-readable. Then again I don't think many people will edit the file manually, let alone based on the request time, so it shouldn't matter.

@untitaker
Copy link
Collaborator

untitaker commented Jan 6, 2022

how about a single file that contains tuples/csv rows:

(status, timestamp * NUM_BUCKETS / cache_expiry_seconds, [xor filter of urls])
  • cache_expiry_second is a user-defined value to determine how old cache items can be

  • timestamp is unix timestamp in seconds

  • NUM_BUCKETS is a hardcoded value that will be important later

a cache row is expired when (timestamp * NUM_BUCKETS / cache_expiry_seconds) (i.e. the second column) is smaller than (current_timestamp - cache_expiry_seconds) * NUM_BUCKETS / cache_expiry_seconds

since cache_expiry_seconds controls how the second row is formatted, the setting has to be written to the cache file. if lychee finds that the setting in the cache file is different from its own config, it can either 1) throw away the cache file 2) do more math to recover the cache file. on the upside, the second column is now much smaller than a timestamp

this puts a theoretical upper bound on the number of rows: there can only be NUM_BUCKETS rows. we know how large each row is, depending on whether we use fixed-width integers and xor filters, we know either an upper bound or the exact row size. so we know either the maximum or exact file size.

this is not for the sake of microoptimization, but to get rid of the file-size question. because things are now microoptimized, you can now choose NUM_BUCKETS freely to define the file size (or, again, an upper bound). you could make it dependent on file system block size, or make it a user setting, however then you also need to write it into the file to do proper cache invalidation (similarly to cache_expiry_second)

@lebensterben
Copy link
Member

Since you mentioned timestamp and cluster, it's worth mentioning that time format such as

2022-01-05T07:32:00-08:00
isn't robust.

  • Epoch time in UTC is not location specific, but is not human friendly.
  • ISO 8601 time + time zone seems better but is not robust because sometimes a location changes its time zone which results in local time discontinuity.

@mre
Copy link
Member Author

mre commented Jan 6, 2022

Guess a UNIX timestamp would be the easiest as it's guaranteed to be UTC.

@mre
Copy link
Member Author

mre commented Jan 12, 2022

Question: if --cache is not set but the .lycheecache file exists, should we still load it? I guess not, but I wanted to hear people's thoughts.

@untitaker
Copy link
Collaborator

untitaker commented Jan 12, 2022 via email

@MichaIng
Copy link
Member

MichaIng commented Jan 12, 2022

I think the existence of --cache (an option to pass cache path) pretty much implies there's no magic filenames that lychee picks up automatically

--cache is a flag, telling lychee to use .lycheecache in current working dir (?). And I agree that without this flag set, .lycheecache should be neither written nor read 👍.

@mre
Copy link
Member Author

mre commented Jan 12, 2022

Added the timestamp functionality we talked about. @untitaker, your approach is superior and quite clever, but I decided to keep it simple for now and just add a timestamp to every entry. It's wasteful, but quite easy to explain and most people wouldn't use a CSV file anyway if they had space constraints.

As for the bloomfilter / xor filter, I quite frankly didn't get far.

StoreExt would have to be modified to have a

fn load<T: AsRef<Path>>(path: T, max_age_secs: u64) -> Result<dyn Cache>;

however the trait object doesn't have a size. impl trait is not allowed in traits at the moment and I don't want to introduce a Box here to avoid the runtime overhead. I also don't think it's very elegant to use a Box here.
Played around with an associated type like in the Handler in this blog post about tower, but in the end it always wants a concrete type from me. ¯\_(ツ)_/¯
If someone has an idea I'd welcome PRs. 😆

Other than that I'd say we could merge this from my side. It's an optional feature anyway, so let's not sweat it. If anyone has the time to give it a spin, I'd be happy for any final comments.

@mre mre mentioned this pull request Jan 13, 2022
@lebensterben
Copy link
Member

looks good enough to merge. no need to finish all features in this PR.

@mre mre merged commit ac490f9 into master Jan 14, 2022
@mre mre deleted the cache-3 branch January 14, 2022 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix URI cache to cache URIs across multiple files
4 participants