Skip to content
Andy Jackson edited this page May 7, 2014 · 6 revisions

A critical aspect of w3act is to decide whether a given URL is in scope. In particular, although any URL is allowed to be entered into w3act, the user interface MUST make it clear to the user whether the Target is in scope, and the Crawl Feeds MUST NOT contain Targets that are not in scope.

Current Scoping Rules

A Target is in scope if any of the following statements is true:

  • Under Legal Deposit, all URLs for this Target meet at least one of the following criteria:
    • UK Web Domain:
      • The authority of the URI (i.e. the hostname) end with '.uk'.
    • UK Hosting:
      • The IP address associated with the URI is geo-located in the UK (using this GeoIP2 database, in a manner similar to our H3 GeoIP module).
      • DEPRECATED: The Target is known to be hosted in the UK (manual boolean field).
    • UK Postal Address:
      • The Target features an page that specified a UK postal address (a manual boolean field plus a text field to hold a specific URL that contains the address).
    • UK Publication (via Correspondence):
      • The Target is known to be a UK publication, according to correspondence with a curator (a manual boolean field plus a text field to hold details of the correspondence).
    • UK Publication (via Professional Judgement):
      • The Target is known to be a UK publication, in the professional judgement of a curator (a manual boolean field plus a text field to hold the justification).
  • or by permission:
    • The Target is one for which we have a license that gives us permission to crawl the site (and make it available), even if the Target does not fall under any Legal Deposit criteria.

This is a policy matter, and so may change in the future. Therefore, the code that implements this logic must be declared one in a well-specified location and re-used throughout.

Upon updating any of the relevant fields, this code should be re-run and used to populate an 'is in scope' field. This can be used to give immediate feedback to users as to whether further information is needed.

When looking up URLs, or when editing a Target, the user should be given rapid feedback as to whether the current set of URLs meet the automatic criteria for inclusion.