Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scope queries via API #8

Open
anjackson opened this issue Nov 21, 2019 · 2 comments
Open

Scope queries via API #8

anjackson opened this issue Nov 21, 2019 · 2 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Nov 21, 2019

For external parties to know which URLs we can crawl, and hence what is worth posting to the save endpoint or what requires a new W3ACT record, we should allow the current permissible crawl scope to be queried.

Essentially, GET /in-scope?url=http://test.url returns true/false.

n.b. this is similar to: ukwa/ukwa-heritrix#37

@anjackson
Copy link
Contributor Author

This Python Trie implementation would make a good 'backbone' for this kind of scope Oracle. Given a URL, and using urlcanon to generate the SSURTs, it can find matching prefixes and use that to map URLs to scope rules.

@anjackson
Copy link
Contributor Author

This seems better for Tries https://pypi.org/project/datrie/

https://pypi.org/project/urlcanon/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant