Skip to content

Automating runner.py via cron

Sean O'Donnell edited this page Sep 14, 2019 · 4 revisions

The main reason I wrote this script was to automate the collection of content from RSS Feeds, via cron. This document will help explain some of the considerations that one must have when configuring the rssbot runner via crontab automation.

Feed-scraping automation strategy considerations

There are various factors to consider when determining the best interval to execute your rssbot via crontab automation.

  • How many feeds am I indexing?
  • How much bandwidth am I paying for on my server/vm/cloud/etc.?
  • How many concurrent processes will my rssbot consume?
  • How many times do I really need to 'ping' RSS Feeds?
  • How much storage space is the rssbot going to consume in my database?
  • How much memory will rssbot consume during it's process?

These are all valid questions. There is no single or perfect answer. It's really subjective to how/where/when you use it.

Inventory capacity planning and performance considerations

Small RSS Feed Inventory

If you have less than 1000 feeds in your RSS feed inventory, then you should probably only run this script no more than a few times a day.

Most likely, you'll only need to execute this script a few times a day, depending on the frequency of new content being published by the feeds that you're indexing.

Large RSS Feed Inventory

If you have more than 1000 feeds in your RSS feed inventory, then you'll need to consider developing your own strategy to best manage the content you're consuming.

By default, the runner.py script will scrape 250 randomly selected feeds from your RSS Feed Inventory, so there really is no out-of-the-box solution, if you're scraping thousands of feeds.

Inventory content considerations

Personal Blogs

Since most personal blogs publish limited amounts of content per day, you don't need to keep hammering at their feed every hour. One to two times a day is generally fine.

Mainstream News/Blogs

Corporate/mainstream news/blog sites generally publish dozens of articles per day, so this is something to consider when determining your crontab interval.

If the majority of your feeds are commercial content like this, you may want to index their content multiple times a day. Once an hour is generally reasonable for such a case, but keep in mind, that means 24 possoible hits per day to he same feed.

Clone this wiki locally