Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursion Support (closes #55, #78) #165

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a365f6a
wip
mre Jan 28, 2021
ef564d6
wip
mre Jan 28, 2021
f65543a
wip
mre Feb 1, 2021
9c6735f
wip
mre Feb 26, 2021
1d47d2b
Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…
mre Feb 28, 2021
5be75b7
update
mre Feb 28, 2021
756e5b4
Replace own cache type with simple hashset
mre Feb 28, 2021
a5a68e8
Move recursion to separate function
mre Feb 28, 2021
0b86f1a
Fix documentation
mre Feb 28, 2021
5310e8b
Add recursion flag
mre Feb 28, 2021
e2a8624
Add logic to terminate the program with recursion
mre Mar 2, 2021
aa9c88b
Rename flag from "recursion" to "recursive"
mre Mar 2, 2021
148fbcd
Add support for max recursion level
mre Mar 2, 2021
23743ed
Simplify loop
mre Mar 2, 2021
8d42cbb
Make code more idiomatic
mre Mar 2, 2021
b73f709
Update comments
mre Mar 2, 2021
15ec139
Formatting
mre Mar 2, 2021
0f73fcb
Formatting
mre Mar 2, 2021
0544799
Mention recursion support in readme
mre Mar 2, 2021
1d73636
Change max-recursion param to depth
mre Mar 3, 2021
c6611ea
Paint progress bar in lychee color
mre Mar 9, 2021
7516dc7
Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…
mre Mar 14, 2021
5be722d
Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…
mre Mar 15, 2021
3c4e646
Add local link extraction test
mre Mar 15, 2021
363cf63
Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…
mre Mar 18, 2021
eb42714
Accept localhost in recursion
mre Mar 19, 2021
b94f4ef
WIP integration test for recursion
mre Mar 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ use
| -------------------- | ------- | ------------- | -------- | --------------------- | ------------ | ------------- | --------------------- | ------ |
| Language | Rust | Ruby | Go | JS | TypeScript | Python | JS | PHP |
| Async/Parallel | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] |
| JSON output | ![yes] | ![no] | ![yes] | ![yes] | ![yes] | ![maybe]<sup>1</sup> | ![yes] | ![yes] |
| Static binary | ![yes] | ![no] | ![yes] | ![no] | ![no] |![no] | ![no] | ![no] |
| JSON output | ![yes] | ![no] | ![yes] | ![yes] | ![yes] | ![maybe]<sup>1</sup> | ![yes] | ![yes] |
| Static binary | ![yes] | ![no] | ![yes] | ![no] | ![no] |![no] | ![no] | ![no] |
| Markdown files | ![yes] | ![yes] | ![no] | ![no] | ![no] | ![yes] |![yes] | ![no] |
| HTML files | ![yes] | ![no] | ![no] | ![yes] | ![yes] | ![no] | ![yes] | ![no] |
| Text files | ![yes] | ![no] | ![no] | ![no] | ![no] | ![no] | ![no] | ![no] |
Expand All @@ -32,7 +32,7 @@ use
| Custom user agent | ![yes] | ![no] | ![no] | ![yes] | ![no] | ![yes] | ![no] | ![no] |
| Relative URLs | ![yes] | ![yes] | ![no] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] |
| Skip relative URLs | ![yes] | ![no] | ![no] | ![maybe] | ![no] | ![no] | ![no] | ![no] |
| Include patterns | ![yes]| ![yes] | ![no] | ![yes] | ![no] | ![no] | ![no] | ![no] |
| Include patterns | ![yes] | ![yes] | ![no] | ![yes] | ![no] | ![no] | ![no] | ![no] |
| Exclude patterns | ![yes] | ![no] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] |
| Handle redirects | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] | ![yes] |
| Ignore insecure SSL | ![yes] | ![yes] | ![yes] | ![no] | ![no] | ![yes] | ![no] | ![yes] |
Expand All @@ -51,7 +51,7 @@ use
| [Use as library] | ![yes] | ![yes] | ![no] | ![yes] | ![yes] | ![no] | ![yes] | ![no] |
| Quiet mode | ![yes] | ![no] | ![no] | ![no] | ![yes] | ![yes] | ![yes] | ![yes] |
| Config file | ![yes] | ![no] | ![no] | ![no] | ![yes] | ![yes] | ![yes] | ![no] |
| Recursion | ![no] | ![no] | ![no] | ![yes] | ![yes] | ![yes] | ![yes] | ![no] |
| Recursion | ![yes] | ![no] | ![no] | ![yes] | ![yes] | ![yes] | ![yes] | ![no] |
| Amazing lychee logo | ![yes] | ![no] | ![no] | ![no] | ![no] | ![no] | ![no] | ![no] |

[awesome_bot]: https://github.com/dkhamsing/awesome_bot
Expand Down Expand Up @@ -165,6 +165,7 @@ FLAGS:
-i, --insecure Proceed for server connections considered insecure (invalid TLS)
-n, --no-progress Do not show progress bar. This is recommended for non-interactive shells (e.g. for
continuos integration)
-r, --recursive Enable recursion (make sub-requests for detected links)
--skip-missing Skip missing input files (default is to error if they don't exist)
-V, --version Prints version information
-v, --verbose Verbose program output
Expand All @@ -174,6 +175,8 @@ OPTIONS:
-b, --base-url <base-url> Base URL to check relative URLs
--basic-auth <basic-auth> Basic authentication support. E.g. `username:password`
-c, --config <config-file> Configuration file to use [default: ./lychee.toml]
--depth <depth> Stop link checking beyond this maximum recursion depth. (Recommended for
large inputs.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should mention that depth less than zero also means infinite.

--exclude <exclude>... Exclude URLs from checking (supports regex)
-f, --format <format> Output file format of status report (json, string) [default: string]
--github-token <github-token> GitHub API token to use when checking github.com links, to avoid rate
Expand Down
125 changes: 117 additions & 8 deletions src/bin/lychee/main.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use anyhow::{anyhow, Context, Result};
use anyhow::{anyhow, bail, Context, Result};
use headers::authorization::Basic;
use headers::{Authorization, HeaderMap, HeaderMapExt, HeaderName};
use indicatif::{ProgressBar, ProgressStyle};
Expand All @@ -8,15 +8,18 @@ use stats::color_response;
use std::{collections::HashSet, time::Duration};
use std::{fs, str::FromStr};
use structopt::StructOpt;
use tokio::sync::mpsc;
use tokio::sync::mpsc::{self, Sender};

mod options;
mod stats;

use crate::options::{Config, LycheeOptions};
use crate::stats::ResponseStats;

use lychee::collector::{self, Input};
use lychee::{
collector::{self, Input},
Cache, Request,
};
use lychee::{ClientBuilder, ClientPool, Response};

/// A C-like enum that can be cast to `i32` and used as process exit code.
Expand Down Expand Up @@ -86,6 +89,20 @@ fn fmt(stats: &ResponseStats, format: &Format) -> Result<String> {
})
}

// Get the set of input domains
// This is needed for supporting recursion
fn input_domains(inputs: Vec<Input>) -> HashSet<String> {
let mut domains = HashSet::new();
for input in inputs {
if let Input::RemoteUrl(url) = input {
if let Some(domain) = url.domain() {
domains.insert(domain.to_string());
}
}
}
domains
}

async fn run(cfg: &Config, inputs: Vec<Input>) -> Result<i32> {
let mut headers = parse_headers(&cfg.headers)?;
if let Some(auth) = &cfg.basic_auth {
Expand Down Expand Up @@ -118,21 +135,26 @@ async fn run(cfg: &Config, inputs: Vec<Input>) -> Result<i32> {
.accepted(accepted)
.build()?;

// Create link cache to keep track of seen links
let mut cache = Cache::new();

let links = collector::collect_links(
&inputs,
cfg.base_url.clone(),
cfg.skip_missing,
max_concurrency,
)
.await?;
let mut total_requests = links.len();

let pb = match cfg.no_progress {
true => None,
false => {
let bar = ProgressBar::new(links.len() as u64)
.with_style(ProgressStyle::default_bar().template(
"{spinner:.red.bright} {pos}/{len:.dim} [{elapsed_precise}] {bar:25} {wide_msg}",
));
"{spinner:.red.bright} {pos}/{len:.dim} [{elapsed_precise}] {bar:25.magenta.bright/white} {wide_msg}",
)
.progress_chars("██"));
bar.enable_steady_tick(100);
Some(bar)
}
Expand All @@ -144,12 +166,13 @@ async fn run(cfg: &Config, inputs: Vec<Input>) -> Result<i32> {
let mut stats = ResponseStats::new();

let bar = pb.clone();
let sr = send_req.clone();
tokio::spawn(async move {
for link in links {
if let Some(pb) = &bar {
pb.set_message(&link.to_string());
};
send_req.send(link).await.unwrap();
sr.send(link).await.unwrap();
}
});

Expand All @@ -160,9 +183,33 @@ async fn run(cfg: &Config, inputs: Vec<Input>) -> Result<i32> {
clients.listen().await;
});

while let Some(response) = recv_resp.recv().await {
let input_domains: HashSet<String> = input_domains(inputs);

// We keep track of the total number of requests
// and exit the loop once we reach it.
// Otherwise the sender would never be dropped and
// we'd be stuck indefinitely.
let mut curr = 0;

while curr < total_requests {
curr += 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lebensterben, I'm not too happy about the above three lines. Is there a better way to handle the request queue compared to counting the number of total requests like I did?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio-stream crate has a wrapper type for Reciever of the channel.
That should work without counting the number.
It's like Future trait, you can poll it.
It's also like Iterator trait, next returns None when it's over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's great. Sounds like what we need.

let response = recv_resp.recv().await.context("Receive channel closed")?;

show_progress(&pb, &response, cfg.verbose);
stats.add(response);
stats.add(response.clone());

if cfg.recursive {
let count = recurse(
response,
&mut cache,
&input_domains,
&cfg,
&pb,
send_req.clone(),
)
.await?;
total_requests += count;
}
}

// Note that print statements may interfere with the progress bar, so this
Expand All @@ -184,6 +231,68 @@ async fn run(cfg: &Config, inputs: Vec<Input>) -> Result<i32> {
}
}

async fn recurse(
response: Response,
cache: &mut Cache,
input_domains: &HashSet<String>,
cfg: &Config,
pb: &Option<ProgressBar>,
send_req: Sender<Request>,
) -> Result<usize> {
let recursion_level = response.recursion_level + 1;

if let Some(depth) = cfg.depth {
if recursion_level > depth {
// Maximum recursion depth reached; stop link checking.
return Ok(0);
}
}

if !response.status.is_success() {
return Ok(0);
}
if cache.contains(response.uri.as_str()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I'm a bit skeptical of the name cache here. We're not really caching, we're merely "checking if this URI was checked before, and should therefore be skipped".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about seen?

return Ok(0);
}
cache.insert(response.uri.to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're allowing infinite recursion, this data structure can really take up the entire memory with sufficient number of links. Depending on the expected maximum number of links checked in recursive mode, we might consider a bloom filter here:

  • HashSet is probably fine if the number of links checked is in the order of millions or fewer
  • Bloom filter might be a good idea if the number of links checked is in the order of tens of millions or more

Super-rough calculations below. (Note that I'm just calculating the size of this one structure, there's probably going to be more memory overheads from various bits of the program).

Assumptions:

  • 64 bit machine, e.g. 8 bytes per pointer
  • 64 bytes per link on average
  • 1 million links

Calculations:

So, HashSet with 1 million unique links will take up ~64+24+1 = ~89 MB (~85 MiB)

For comparison, a Bloom Filter with ~1e-7 false positive probability is expected to take up:

So, to be honest, HashSet should be absolutely fine for the most users. But we should make a remark in the README that memory usage when using recursive mode with tens millions of links might cause RAM usage issues.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, would it be merged soon? and Docker image updated with the change to make GH action use the latest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawroman: Good idea with the bloom filter. I'll add that as soon as #208 is merged.
@lenisha: Hopefully yes. I don't like the current implementation anymore (see comment below), but I think it's not gonna be a blocker either way.


if let lychee::Uri::Website(url) = response.uri {
let input = collector::Input::RemoteUrl(url.clone());

match url.domain() {
None => bail!("Cannot find domain in url: {}", url),
Some(domain) => {
if !input_domains.contains(domain) {
return Ok(0);
}
}
}

let links = collector::collect_links(
&[input],
cfg.base_url.clone(),
cfg.skip_missing,
cfg.max_concurrency,
)
.await?;
let count = links.len();

let bar = pb.clone();
tokio::spawn(async move {
for mut link in links {
link.recursion_level = recursion_level;
if let Some(pb) = &bar {
pb.inc_length(1);
pb.set_message(&link.to_string());
};
send_req.send(link).await.unwrap();
}
});
return Ok(count);
};
Ok(0)
}

fn read_header(input: &str) -> Result<(String, String)> {
let elements: Vec<_> = input.split('=').collect();
if elements.len() != 2 {
Expand Down
10 changes: 10 additions & 0 deletions src/bin/lychee/options.rs
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,15 @@ pub struct Config {
#[structopt(short, long, default_value = "string")]
#[serde(default)]
pub format: Format,

/// Enable recursion (make sub-requests for detected links)
#[structopt(short, long)]
#[serde(default)]
pub recursive: bool,

/// Stop link checking beyond this maximum recursion depth. (Recommended for large inputs.)
#[structopt(long)]
pub depth: Option<usize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider allowing this to be signed (e.g. isize), so that users could explicitly express the need for infinite recursion with numbers less than zero (e.g. -1).

}

impl Config {
Expand Down Expand Up @@ -299,6 +308,7 @@ impl Config {
skip_missing: false;
glob_ignore_case: false;
output: None;
recursive: false;
}
}
}
5 changes: 5 additions & 0 deletions src/bin/lychee/stats.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,16 +127,19 @@ mod test_super {
uri: website("http://example.org/ok"),
status: Status::Ok(http::StatusCode::OK),
source: Input::Stdin,
recursion_level: 0,
});
stats.add(Response {
uri: website("http://example.org/failed"),
status: Status::Failed(http::StatusCode::BAD_GATEWAY),
source: Input::Stdin,
recursion_level: 0,
});
stats.add(Response {
uri: website("http://example.org/redirect"),
status: Status::Redirected(http::StatusCode::PERMANENT_REDIRECT),
source: Input::Stdin,
recursion_level: 0,
});
let mut expected_map = HashMap::new();
expected_map.insert(
Expand All @@ -146,11 +149,13 @@ mod test_super {
uri: website("http://example.org/failed"),
status: Status::Failed(http::StatusCode::BAD_GATEWAY),
source: Input::Stdin,
recursion_level: 0,
},
Response {
uri: website("http://example.org/redirect"),
status: Status::Redirected(http::StatusCode::PERMANENT_REDIRECT),
source: Input::Stdin,
recursion_level: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For convenience and to cover the "most common case", it might be worth considering adding an alternative constructor for Response which would set recursion_level to 0.

Or even have ::new constructor which would have this behavior (recursion = 0) and another one named something like ::with_recursion which would allow setting recursion level.

},
]
.into_iter()
Expand Down
16 changes: 13 additions & 3 deletions src/client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,12 @@ impl Client {
Err(_e) => bail!("Invalid URI"),
};
if self.filter.excluded(&request) {
return Ok(Response::new(request.uri, Status::Excluded, request.source));
return Ok(Response::new(
request.uri,
Status::Excluded,
request.source,
request.recursion_level,
));
}
let status = match request.uri {
Uri::Website(ref url) => self.check_website(&url).await,
Expand All @@ -181,7 +186,12 @@ impl Client {
}
}
};
Ok(Response::new(request.uri, status, request.source))
Ok(Response::new(
request.uri,
status,
request.source,
request.recursion_level,
))
}

pub async fn check_website(&self, url: &Url) -> Status {
Expand Down Expand Up @@ -264,7 +274,7 @@ impl Client {
}
}

/// A convenience function to check a single URI
/// A convenience function to check a single URI.
/// This is the most simple link check and avoids having to create a client manually.
/// For more complex scenarios, look into using the `ClientBuilder` instead.
pub async fn check<T: TryInto<Request>>(request: T) -> Result<Response> {
Expand Down
4 changes: 2 additions & 2 deletions src/client_pool.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ impl ClientPool {
}

pub async fn listen(&mut self) {
while let Some(req) = self.rx.recv().await {
while let Some(uri) = self.rx.recv().await {
let client = self.pool.get().await;
let tx = self.tx.clone();
tokio::spawn(async move {
let resp = client.check(req).await.expect("Invalid URI");
let resp = client.check(uri).await.expect("Invalid URI");
tx.send(resp)
.await
.expect("Cannot send response to channel");
Expand Down
6 changes: 3 additions & 3 deletions src/collector.rs
Original file line number Diff line number Diff line change
Expand Up @@ -223,14 +223,14 @@ pub async fn collect_links(
drop(contents_tx);

// extract links from input contents
let mut extract_links_handles = vec![];
let mut extract_link_handles = vec![];

while let Some(result) = contents_rx.recv().await {
for input_content in result? {
let base_url = base_url.clone();
let handle =
tokio::task::spawn_blocking(move || extract_links(&input_content, base_url));
extract_links_handles.push(handle);
extract_link_handles.push(handle);
}
}

Expand All @@ -240,7 +240,7 @@ pub async fn collect_links(
// a lot of inputs and/or the inputs are large (e.g. big files).
let mut collected_links: HashSet<Request> = HashSet::new();

for handle in extract_links_handles {
for handle in extract_link_handles {
let links = handle.await?;
collected_links.extend(links);
}
Expand Down
3 changes: 2 additions & 1 deletion src/extract.rs
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ pub(crate) fn extract_links(
for link in links {
match Uri::try_from(link.as_str()) {
Ok(uri) => {
requests.insert(Request::new(uri, input_content.input.clone()));
requests.insert(Request::new(uri, input_content.input.clone(), 0));
}
Err(_) => {
if !Path::new(&link).exists() {
Expand All @@ -166,6 +166,7 @@ pub(crate) fn extract_links(
requests.insert(Request::new(
Uri::Website(new_url),
input_content.input.clone(),
0,
));
}
}
Expand Down
Loading