Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend repitition test to span across two files #349

Merged
merged 13 commits into from
Oct 7, 2021
Merged

Extend repitition test to span across two files #349

merged 13 commits into from
Oct 7, 2021

Conversation

MichaIng
Copy link
Member

A first step to address: #348

@MichaIng
Copy link
Member Author

MichaIng commented Sep 30, 2021

Strange, the test succeeds:

test cli::test_repetition ... ok

with contradicts my tests where in this case the URLs were checked and counted once per file. Did I do the syntax right to add the two input files?

Or the issue was fixed in the meantime between my tests and this commit? 🤔
I'll repeat my tests now with lychee compiled from latest master.

lychee-bin/tests/cli.rs Outdated Show resolved Hide resolved
@MichaIng
Copy link
Member Author

Okay, "great", no it fails as expected 🙂.

@MichaIng
Copy link
Member Author

Here is the part which should filter out cached links. Mandatory for this to work of course is that there is only a single Collector instance:

// Filter out already cached links (duplicates)
links.retain(|l| !self.cache.contains(&l.uri));

@mre
Copy link
Member

mre commented Sep 30, 2021

I wonder why we even have that line at all. Given that links is a HashSet<Request>, it should just work if we update it as we do in the line after that.

@MichaIng
Copy link
Member Author

MichaIng commented Sep 30, 2021

Hmm, you mean identical URIs then have an identical hash (=key) so that duplicates shouldn't be possible? I also see only a single let links = Collector::new( Collector instance (outside of examples and tests) in main.rs > async fn run(, so at least there is no obvious reason why multiple instances could exist.

Ah wait, link checks start already (async) after one file has been processed, right? Is it possible that a link is checked before the duplicate has been parsed, and when being parsed a second time the cached result is overridden or so?

@MichaIng
Copy link
Member Author

MichaIng commented Sep 30, 2021

In test_glob_ignore_case it also needs to expect only a single check, I guess, same as in test_glob?
As an alternative, probably there is a way to have different URIs for the mocked servers? Not sure what makes most sense there, whether we want to include testing the URI cache or explicitly want to test whether two different URIs/servers are checked independently.

Otherwise, awesome, the cache now seems to work 👍. Probably links.retain can also be skipped if links.extend + update_cache excludes duplicates already?
Or probably it can be made a little more efficient, to skip/remove duplicates only from new_links instead of checking this of all links on every iteration for the handle loop?

@MichaIng
Copy link
Member Author

Failing tests:

  • test_collect_links
  • test_require_https

@MichaIng
Copy link
Member Author

MichaIng commented Sep 30, 2021

Okay, I think something has gone wrong. test_collect_links shows that finally only a single URL was returned in responses while there are definitely 5 different URIs added. Probably the test_glob* tests should still show two checked URIs (and hence the two mocked servers have different URIs) and only one is checked for a similar reason than in test_collect_links.

In test_require_https the URL needs to stay plain HTTP as is is explicitly about testing the --require-https option. I'll revert this.


@mre
Okay I think I found the reason for the lost URIs:

  • links.retain(|l| !self.cache.contains(&l.uri)); now removes the URIs which were just added in the previous loop iteration, as they are now synced to the cache immediately. So only the last URI remains.
  • Instead of removing everything from links which is in the cache, we need to do this for new_links only. So in every iteration of the loop, only URIs which are not in the cache yet are added to the function-wide links, but nothing which was added to links already must be removed.
  • Also instead of adding all links to the cache on every loop iteration, we could do this with new_links only, which should reduce overhead a little. I'm just not sure how to adjust update_cache argument and handling when passing new_links.
  • Now it is also clear why the additional retain is still required. links and cache are both hash sets, so they cannot contain duplicates naturally. But we want to avoid that collect_links > client instances check the same links each. So links itself cannot contain duplicates, but it also must not contain any URI which was added by another collect_links instance to the cache before. Did I understand this right? 😅

lychee-lib/src/client.rs Outdated Show resolved Hide resolved
@MichaIng
Copy link
Member Author

MichaIng commented Oct 1, 2021

I tried to fix the issue that each loop clears links. I hope links and new_links are the same variable type, else it would fail and needs fix by someone who is able to fix e.g. update_cache argument definitions accordingly.
EDIT: Okay does not work as new_links is no HashSet. Shall we make it one or shall we update the methods accordindly?

The whole issue with the cache, why it didn't work as expected, seems to be due to the race condition between multiple collect_links instances when checking their links against the cache and then updating the cache afterwards. Updating the cache directly within the loop and directly after already cached URLs have been removed from new links, minimises this race condition, but does not fully eliminate it. In theory the new links would need to be "locked" in the cache first before being filtered, so that in the meantime a second instance cannot add any of those to the cache. But I have no good idea how achieve this best.
EDIT: Hmm, actually reviewing the previous code, I am not 100% sure anymore whether the race condition really is the issue. I mean links was filtered against cache directly before they were added to the cache. I cannot imagine that there is a notable amount of time between those two steps so that a race condition is relevant? But I have no other explanation why it didn't/doesn't work either. We'll see when having syntax of current state fixed, so that links are filtered and cache updated with a smaller amount of links, potentially reducing the time between both steps.

@mre
Copy link
Member

mre commented Oct 2, 2021

Adding the cache to the collector was a mistake. The collector should have a single responsibility: collecting links. If anything then we should have a cache for requests. There it would be a simple lookup if a URL was already requested with the added benefit of tracking the number of duplicate links in the future. In conclusion I'd vote for removing the cache entirely for now and add it back later. 😉

@MichaIng
Copy link
Member Author

MichaIng commented Oct 2, 2021

Just for my understanding:

  • Based on max concurrency (and number of links, of course), there is a number of concurrent collect_links instances, which return links, so I guess there is the same number of client side request handlers attached to those?
  • One input cannot be handled by multiple collect instances but basically the inputs are sorted into multiple collect instances?

So there needs to be one global cache, links or requests pool, and it shouldn't matter which one it is, as long as duplicates across all concurrent instances are filtered at some point, in theory the earlier the better (less overhead). I'm not sure why it doesn't work currently (side of the currently false syntax 😉), so if you don't see an easy way to fix it the way it was intended initially, I agree to remove it for now, especially since I'm looking forward for a release with local files support 🙂.

Wouldn't it be possible that all collect instances feed a global links pool, being a hash set to avoid duplicates in the first place, and to have all request handlers then directly pick from this global pool (locking/tagging the entry first so that it is not handled a second time)? That way we don't need a separate cache to check against.
EDIT: Actually you had this idea already: #55
And would address as well: #163
Probably there is a way to store the links/URIs and their response in the same HashSet, so that it is a single pool for everything, which naturally excludes dupes on both ends.

@MichaIng
Copy link
Member Author

MichaIng commented Oct 5, 2021

Okay how shall we proceed? The cache does not work reliably, but if really concurrency/race condition is the issue, then it should at least work partly when using multiple and larger inputs, and I do not see a general issue to filter it in the collector already, at least until we find time for a better implementation. I wonder whether #330 will help 🤔. Probably we could leave things for now and rebase the two-files test extension afterwards.

If anyone knows how to fix the current state so that new_links is filtered against the cache, we could at least see whether it helps. That retain(|l| !self.cache.contains(&l.uri)) generally works has been proven by the previous CI test results, where links was "correctly" cleared as all contained links were found in the cache already: #349 (comment)

But otherwise it is trivial to remove the parts with the cache for now and revert the tests to check for filtered repetitions in a single file only.

lychee-lib/src/collector.rs Outdated Show resolved Hide resolved
@mre
Copy link
Member

mre commented Oct 6, 2021

I think there is an option to just add a #[cached] derive to the client check method similar to the one we already have for the path look up. 🤔 That should solve our issues and be way more elegant imho.

@MichaIng
Copy link
Member Author

MichaIng commented Oct 6, 2021

Sounds like a more native solution 👍. Btw, shall I outsource the HTTP => HTTPS changes into an own PR?

mre and others added 9 commits October 7, 2021 12:58
Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
Signed-off-by: MichaIng <[email protected]>
@mre
Copy link
Member

mre commented Oct 7, 2021

I've removed caching across multiple files as discussed.
Was hoping to add back the caching to client with a #[cached] macro as we do for paths, but no luck because #[cached] is not supported for methods yet. Did some refactoring around the client while I was at it, though. Namely, I removed the deadpool because it's not needed (reqwest comes with its own request pool). So while this is not strictly related to the goal of this PR, I'd still like to keep the changes, as performance is not worse in my tests (probably faster if anything) and we get rid of a dependency and quite a few allocations (~2% in my tests for the sentry docs example).

@MichaIng
Copy link
Member Author

MichaIng commented Oct 7, 2021

How is the collector now tied to the clients? I had the impression that one collector iteration was tied to one client, but when there is no client pool now but a single instance with internal threading/pool, may this even solve the problem of concurrent collector iterations?

@mre
Copy link
Member

mre commented Oct 7, 2021

Not sure I understand. The collector and clients were always independent (?). The collector would return a HashSet of requests and those would be fed to the client pool via a channel. So there was no coupling between the two. So there shouldn't have been any race condition. Now that we removed the client pool, we feed the requests to the same client instance. Apart from that nothing changed.

Sorry, I'm pretty sure I misunderstood. 😅

@mre
Copy link
Member

mre commented Oct 7, 2021

Apart from that, this is ready to be merged from my point of view.

@MichaIng
Copy link
Member Author

MichaIng commented Oct 7, 2021

Ah okay, was my misunderstanding then. I was just wondering where this Ok(links) of the collector is going to, as this is assured to not contain dupes. Found it now here: https://github.com/lycheeverse/lychee/blob/251332e/lychee-bin/src/main.rs#L210-L213

Actually I don't understand where the concurrency of the link collection comes from, to me it looks like there is a single Collector => collect_links => Ok(links), while only filling the links HashSet is done by concurrent handles. So when there is only a single links HashSet filled and returned, how is it possible that links are checked multiple times? Is the status of a link check stored in this HashSet as well and may links.extend(new_links?); reset this status when one of the new_links is present already?

@mre
Copy link
Member

mre commented Oct 7, 2021

Is the status of a link check stored in this HashSet as well and may links.extend(new_links?); reset this status when one of the new_links is present already?

Yeah something like that. TBH I don't really care because the current cache impl is broken and too entangled with the collector. We'll find a better option like using a XOR filter, which is very memory-efficient and fast. But that's for another PR. 😉 Will merge this one now unless anyone has any final comments.

@mre mre merged commit 961f12e into lycheeverse:master Oct 7, 2021
@mre
Copy link
Member

mre commented Oct 7, 2021

@MichaIng I've merged the changes now. Let's build a better caching layer in the future. As always, thanks for the initial PR and the feedback during the development. 😃

@mre
Copy link
Member

mre commented Oct 7, 2021

Meh, just tested master and it's super slow now. Looks like I misunderstood reqwest's concurrency model. It feels more like it's requesting every page sequentially. Maybe it's a simple fix without having to add back deadpool. 🤔
@MichaIng can you check?

@MichaIng
Copy link
Member Author

MichaIng commented Oct 7, 2021

Yes I can confirm, on our website, with latest master, execution time raised from ~8 seconds to 2 minutes, on the docs from 13 seconds to 13 minutes (!). That client pool definitely needs to stay until another way of concurrency is found.

@mre mre mentioned this pull request Oct 8, 2021
@mre
Copy link
Member

mre commented Oct 8, 2021

Created a PR and will merge it when it gets green. That should take care of it.

@dcroote dcroote mentioned this pull request Oct 22, 2021
@mre mre mentioned this pull request Jan 4, 2022
4 tasks
mre added a commit that referenced this pull request Jan 14, 2022
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:

 * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version.    Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

This is an optional feature. Caching only gets used if the `--cache` flag is set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants