-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle recursive sitemaps #7
Comments
Found this interesting package My code snippet to get list of all URLs from all sitemaps in the website: from usp.tree import sitemap_tree_for_homepage
def get_all_urls_from_all_sitemaps(website):
tree = sitemap_tree_for_homepage(website)
return [page.url for page in tree.all_pages()]
get_all_urls_from_all_sitemaps('https://www.dailythanthi.com/') Gave me around 7000 article links. Pretty cool actually! |
The above code first looks for the |
This seems like a useful thing. But I was thinking in which cases it will be useful since we have decided to use recursive crawls now. I have not seen many sources that have first level sitemap all fine, but the next levels are broken or something. Do you reckon this can be useful? |
Sure, we'll close it then. But we'll also do the following:
|
There are some sitemaps which recursively contains sitemaps. For instance:
https://www.dailythanthi.com/Sitemap/Sitemap.xml
But the recursive sitemaps may or may not comply to the sitemap format.
An example for recursive sitemap that complies to the sitemap format:
https://hindi.news18.com/sitemap.xml
Todo:
We'll better extract the URLs (http) from these links.
The text was updated successfully, but these errors were encountered: