Scope queries via API #8

anjackson · 2019-11-21T14:57:09Z

For external parties to know which URLs we can crawl, and hence what is worth posting to the save endpoint or what requires a new W3ACT record, we should allow the current permissible crawl scope to be queried.

Essentially, GET /in-scope?url=http://test.url returns true/false.

n.b. this is similar to: ukwa/ukwa-heritrix#37

The text was updated successfully, but these errors were encountered:

anjackson · 2020-02-15T23:07:20Z

This Python Trie implementation would make a good 'backbone' for this kind of scope Oracle. Given a URL, and using urlcanon to generate the SSURTs, it can find matching prefixes and use that to map URLs to scope rules.

anjackson · 2021-02-28T13:51:51Z

This seems better for Tries https://pypi.org/project/datrie/

https://pypi.org/project/urlcanon/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope queries via API #8

Scope queries via API #8

anjackson commented Nov 21, 2019 •

edited

Loading

anjackson commented Feb 15, 2020

anjackson commented Feb 28, 2021

Scope queries via API #8

Scope queries via API #8

Comments

anjackson commented Nov 21, 2019 • edited Loading

anjackson commented Feb 15, 2020

anjackson commented Feb 28, 2021

anjackson commented Nov 21, 2019 •

edited

Loading