-
-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursion Support (closes #55, #78) #165
Conversation
Could you let me know what this task is about? The title does give me a hint, but I fail to understand the context? |
Sorry, forgot to link the issue. Here it is: #78. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Great to see this implemented.
I have left a few code comments and I have one generic comment: I think we should add at least one simple test for the recursive link checking.
@@ -174,6 +175,8 @@ OPTIONS: | |||
-b, --base-url <base-url> Base URL to check relative URLs | |||
--basic-auth <basic-auth> Basic authentication support. E.g. `username:password` | |||
-c, --config <config-file> Configuration file to use [default: ./lychee.toml] | |||
--depth <depth> Stop link checking beyond this maximum recursion depth. (Recommended for | |||
large inputs.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should mention that depth less than zero also means infinite.
|
||
/// Stop link checking beyond this maximum recursion depth. (Recommended for large inputs.) | ||
#[structopt(long)] | ||
pub depth: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider allowing this to be signed (e.g. isize
), so that users could explicitly express the need for infinite recursion with numbers less than zero (e.g. -1
).
}, | ||
Response { | ||
uri: website("http://example.org/redirect"), | ||
status: Status::Redirected(http::StatusCode::PERMANENT_REDIRECT), | ||
source: Input::Stdin, | ||
recursion_level: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For convenience and to cover the "most common case", it might be worth considering adding an alternative constructor for Response
which would set recursion_level
to 0.
Or even have ::new
constructor which would have this behavior (recursion = 0) and another one named something like ::with_recursion
which would allow setting recursion level.
* } | ||
* ``` | ||
*/ | ||
//! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to make this a doc-comment!
But isn't a blank line here creating a newline in the rendered docs? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't check yet, but I assumed cargo would be smart enough to trim empty lines at the beginning and the end.
if !response.status.is_success() { | ||
return Ok(0); | ||
} | ||
if cache.contains(response.uri.as_str()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I'm a bit skeptical of the name cache
here. We're not really caching, we're merely "checking if this URI was checked before, and should therefore be skipped".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about seen
?
if cache.contains(response.uri.as_str()) { | ||
return Ok(0); | ||
} | ||
cache.insert(response.uri.to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're allowing infinite recursion, this data structure can really take up the entire memory with sufficient number of links. Depending on the expected maximum number of links checked in recursive mode, we might consider a bloom filter here:
- HashSet is probably fine if the number of links checked is in the order of millions or fewer
- Bloom filter might be a good idea if the number of links checked is in the order of tens of millions or more
Super-rough calculations below. (Note that I'm just calculating the size of this one structure, there's probably going to be more memory overheads from various bits of the program).
Assumptions:
- 64 bit machine, e.g. 8 bytes per pointer
- 64 bytes per link on average
- 1 million links
Calculations:
- Link string storage: 64 MB
- String overhead (24 bytes per string): 24 MB
source: https://cheats.rs/#owned-strings - HashSet overhead is 1 byte per entry: 1MB
source: https://github.com/rust-lang/hashbrown#features
So, HashSet with 1 million unique links will take up ~64+24+1 = ~89 MB (~85 MiB)
For comparison, a Bloom Filter with ~1e-7 false positive probability is expected to take up:
- ~4MiB for 1 million links: https://hur.st/bloomfilter/?n=1M&p=1.0E-7&m=&k=
- ~80MiB for 20 million liks: https://hur.st/bloomfilter/?n=20M&p=1.0E-7&m=&k=
- ~4GiB for 1 billion links: https://hur.st/bloomfilter/?n=1G&p=1.0E-7&m=&k=
So, to be honest, HashSet should be absolutely fine for the most users. But we should make a remark in the README that memory usage when using recursive mode with tens millions of links might cause RAM usage issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, would it be merged soon? and Docker image updated with the change to make GH action use the latest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I'm not super happy with the current impl anymore as I count the links in the queue and then close the channel after all links got checked. It can lead to subtle bugs I think. |
We can have a dedicated
|
let mut curr = 0; | ||
|
||
while curr < total_requests { | ||
curr += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lebensterben, I'm not too happy about the above three lines. Is there a better way to handle the request queue compared to counting the number of total requests like I did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokio-stream crate has a wrapper type for Reciever of the channel.
That should work without counting the number.
It's like Future trait, you can poll it.
It's also like Iterator trait, next
returns None when it's over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's great. Sounds like what we need.
Yup, that's correct. |
Will put this on hold once again as we started implementing a stream-based approach in #330, which might supersede this branch soon. Sorry to everyone waiting on recursion support to land, but I'd like to get this right instead of merging a buggy solution prematurely. |
master
then and make adjustments in this PR (or create new PR). Closing this for now.This is a super basic recursion implementation of recursion as discussed in #78.
It doesn't use an extractor pool as discussed in #55. I think it's not needed at this point.
The current implementation just checks the responses, filters out websites, and pushes them back into the request queue.
I deliberately avoided to add the recursion support into the client. It would break separation of concerns. Also didn't want to add a separate recursion handler to the library as it I wouldn't expect any gains from that. Recursion handling is surprisingly tricky, because there are quite a few design decisions to make. (Recurse indefinitely or stop at a predefined recursion depth? Recurse on all domains or just the set of input domains?)
So most people using the library would have to make the decisions based on their use-case. The current recursion depth is part of the request/response struct, so it won't have to be wrapped in another struct at least.
TODO: