Handle recursive sitemaps #7

GokulNC · 2019-10-24T04:49:31Z

There are some sitemaps which recursively contains sitemaps. For instance:
https://www.dailythanthi.com/Sitemap/Sitemap.xml

But the recursive sitemaps may or may not comply to the sitemap format.
An example for recursive sitemap that complies to the sitemap format:
https://hindi.news18.com/sitemap.xml

Todo:
We'll better extract the URLs (http) from these links.

GokulNC · 2019-10-24T10:16:40Z

Found this interesting package ultimate_sitemap_parser.

My code snippet to get list of all URLs from all sitemaps in the website:

from usp.tree import sitemap_tree_for_homepage

def get_all_urls_from_all_sitemaps(website):
    tree = sitemap_tree_for_homepage(website)
    return [page.url for page in tree.all_pages()]

get_all_urls_from_all_sitemaps('https://www.dailythanthi.com/')

Gave me around 7000 article links. Pretty cool actually!
We can just use all these URLs to crawl all the articles.

GokulNC · 2019-10-24T16:39:35Z

The above code first looks for the robots.txt of the website, hence this might be a possible solution for issue #6

divkakwani · 2019-12-22T11:17:06Z

This seems like a useful thing. But I was thinking in which cases it will be useful since we have decided to use recursive crawls now. I have not seen many sources that have first level sitemap all fine, but the next levels are broken or something. Do you reckon this can be useful?

GokulNC · 2019-12-23T04:25:17Z

Sure, we'll close it then. But we'll also do the following:

After the recursive crawls are complete/exhausted, we'll augment our data with the articles from sitemap (if they're not covered by the recursive crawls). We may use the above library to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle recursive sitemaps #7

Handle recursive sitemaps #7

GokulNC commented Oct 24, 2019

GokulNC commented Oct 24, 2019

GokulNC commented Oct 24, 2019

divkakwani commented Dec 22, 2019

GokulNC commented Dec 23, 2019 •

edited

Loading

Handle recursive sitemaps #7

Handle recursive sitemaps #7

Comments

GokulNC commented Oct 24, 2019

GokulNC commented Oct 24, 2019

GokulNC commented Oct 24, 2019

divkakwani commented Dec 22, 2019

GokulNC commented Dec 23, 2019 • edited Loading

GokulNC commented Dec 23, 2019 •

edited

Loading