Skip to content

How to Manage Searchgov Domains & indexed content

Martha Thompson edited this page Jun 3, 2021 · 45 revisions

How to index a new domain

  • With Resque running, add a new domain on the Search.gov Domains super admin page.
  • The SearchgovDomainPreparerJob will run in the background to set the domain scheme and status, and enqueue the SitemapIndexerJob to index the sitemap urls. (Note, when testing, it helps to test a domain with a limited number of sitemap URLs, such as search.gov: https://search.gov/sitemap.xml)
  • Refresh the "Search.gov Domains" page to verify that the "URLs count" for your domain has increased. The "Unfetched URLs Count" will decrease as the URLs are indexed.

How to update a domain's status

If a searchgov domain is unavailable due to a 403 block or a temporary issue, any indexing or crawling jobs will be blocked. The SearchgovDomain#check_status will check and update the status for a given domain:

> sd = SearchgovDomain.find_by(domain: 'www.census.gov')
=> #<SearchgovDomain id: 929, domain: "www.census.gov", clean_urls: true, status: "403 Forbidden", urls_count: 24654, unfetched_urls_count: 24654, created_at: "2018-06-27 14:06:17", updated_at: "2018-06-27 14:06:18", scheme: "http">
# update the status
> sd.check_status
[Document Fetch] {"domain":"www.census.gov","time":"2018-07-02 16:07:40","type":"searchgov_domain","url":"http://www.census.gov/"}
=> "200 OK"
# update the activity if the `sd.activity` is "indexing"
> sd.done_indexing!
# restart indexing
> sd.index_sitemaps

How to "Un-Stick" a Domain

If the domain indexing process dies due to a bug or other issue, you can often "un-stick" the stuck domain by performing the following steps:

pp sd = SearchgovDomain.find_by(domain: 'www.census.gov')

# Ensure that the domain is actually stuck (returning a '200', has unfetched URLs, but no URLs have been fetched for a suspiciously long time)
if sd.check_status == '200 OK' && sd.activity == 'indexing' && sd.searchgov_urls.fetch_required.any? && sd.searchgov_urls
  last_crawled_at = sd.searchgov_urls.maximum(:last_crawled_at)
  
  if last_crawled_at && last_crawled_at < 6.hours.ago
    sd.done_indexing!
    sd.index_urls
  end
end

How to bulk index a list of URLs

...the new-fashioned way, via Super Admin

Visit the Bulk Search.gov URL Upload page and follow the instructions.

...the old-fashioned way, via the console

  1. Ensure your CSV is properly delimited. If any urls include commas, they must be double-quoted.
  2. Add your CSV list of URLs to the /home/search/dev_stuff/crawls/ directory
  3. In the /home/search/usasearch/current directory, fire up the rails console:
$ bundle exec rails c

Create and index the SearchgovUrl records:

filepath = '/path/to/your/file/file.csv'
domain = 'whatever.gov' # the domain of the URLs in the file to be indexed

CSV.foreach(filepath) do |row|
  url = row[0] #or the appropriate column
  SearchgovUrl.create!(url: url)
  puts "Created #{url}".green
rescue => e
  if e.message == "Validation failed: Url has already been taken"
    puts "Already exists: #{url}"
  else
    puts "Error processing #{url}: #{e}".red
  end
end

# kick off indexing those URLs
sd = SearchgovDomain.find_by(domain: domain)

# if the sd.activity is already 'indexing', you are done;
# the new URLs will automatically be indexed.
# If the 'activity' is 'idle', kick off the indexing:
sd.index_urls

How to re-index all URLs for a domain

We automatically re-fetch and re-index all URLs every 30 days. However, sometimes a site admin will make a bulk change to the pages on their site, such as changing metadata or page structure, and they will request that we re-index their site. The "Reindex" button on the Searchgov Domains page in Super Admin will:

  • set all their indexed URLs to be re-fetched (enqueued_for_reindex = true)
  • pull new sitemap URLs
  • kick off indexing

It does NOT delete any URLs. Any URLs that now redirected or return 404's will automatically be removed from our search index in the process of re-fetching. (If for some reason you need to force a mass-deletion of URLs, see below.)

How to promote a list of URLs

  1. Add your CSV list of URLs to the /home/search/crawls/ directory
  2. In the /home/search/usasearch/current directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/promotion_list.csv]

How to demote a list of URLs

  1. Add your CSV list of URLs to the /home/search/crawls/ directory
  2. In the /home/search/usasearch/current directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/demotion_list.csv,false]

How to Index Sitemaps

Index a sitemap for one domain

Note: Super admins can now add domains via the Search.gov Domains page. Newly added domains will automatically have their sitemaps indexed, if they are listed on robots.txt OR located at whatever.gov/sitemap.xml or whatever.gov/sitemap_index.xml.

> searchgov_domain = SearchgovDomain.find_by(domain: 'www.foo.gov')
> searchgov_domain.index_sitemaps

Index sitemaps for all domains

SearchgovDomain.where('status = "200 OK"').each do |sd|
  SitemapIndexerJob.perform_later(searchgov_domain: sd)
end

Caveat: If a sitemap contains redirected URLs, the sitemap indexing job may create additional new SearchgovUrl records. It does not currently fetch those automatically (but it will). For now, you can follow up a sitemap indexing process by indexing any unfetched urls:

For one domain:

SearchgovDomain.find_by(domain: 'foo.gov').index_urls

For any domains with unfetched urls:

SearchgovDomain.where('unfetched_urls_count > 0').each{|sd| sd.index_urls }

Index URLs from any given sitemap

Sometimes we are asked to index a sitemap at a given url that's neither /sitemap.xml nor listed on robots.txt. We're discouraging sites from doing that, because it prevents us and other search engines from indexing those automatically. However, super admins can now add those sitemap URLs via the Search.gov Domains page.

If for some reason you need to extract the URLs from a sitemap manually, you can index this way:

sitemap_url = 'https://whatever.gov/whatever_sitemap.xml'

SitemapIndexer.new(sitemap_url: sitemap_url).index

How to delete a domain

Domain deletion may involve the deletion of hundreds of thousands of SearchgovUrl records, as well as the corresponding indexed documents in ElasticSearch. Deletion of that data can take a very long time, so SearchgovDomain records should be deleted via a job:

> searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
> SearchgovDomainDestroyerJob.perform_later(searchgov_domain: searchgov_domain)

There are several ways to monitor a domain deletion job:

  • In the console, run searchgov_domain.reload.urls_count. For a while, that count will drop, and eventually it will error, because the domain will no longer exist.
  • Check the URLs count/ domain existence of that domain in https://search.usa.gov/admin/searchgov_domains
  • Run the job synchronously in the console with perform_now instead of perform_later. (Note: you will see failures for many URLs. This is expected, as we make a DELETE request for every URL, not just the indexed ones, just in case something is out of sync between the Rails DB and the I14y index.)

The deletion job for very large domains (hundreds of thousands of URLs) may time out. If that occurs, you will need to run the destroyer job multiple times. A ticket to resolve this is in the backlog.

How to delete a given set of URLs

In general, we should avoid manually deleting URLs unless absolutely necessary. A fetch of an erroring URL automatically removes it from our search index. However, we occasionally want to mass delete a set of URLs (such as when a large domain updates their website, resulting in a prohibitive number of URLs 404'ing). In such cases, we can delete them via the console. (NOTE: searchgov_url.destroy includes a callback to remove the URL from our search index via the I14y API. This can be time-consuming, but it's safer to ensure we've removed the URL from our search index, even for not_ok URLs.) Example:

searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
searchgov_domain.searchgov_urls.where(last_crawl_status: '404').find_each{ |su| su.destroy }

How to pause indexing for a given domain

Hack alert! We do not yet have a clean way to pause indexing for a domain, so for now, the solution is to set the domain's status to anything other than 200 OK, and set the activity back to idle:

searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
sd.update!(status: 'temporarily pausing indexing - Jane Doe, 8/5/20', activity: 'idle')

How to check HTML/web document parsing

Need to verify what data we're extracting from a web page or document? You just need the URL and/or an HTML snippet:

Parse an HTML document

> url = 'https://search.gov/blog/sitemaps.html'
> doc = HtmlDocument.new(document: open(url).read, url: url)
> doc.title
"XML Sitemaps"
> doc.description
""
> puts doc.parsed_content
XML Sitemaps
An XML sitemap  (External link) is an XML formatted file...

# other methods
> ls doc
WebDocument#methods: changed  created  document  language  metadata  parsed_content  url
RobotsTaggable#methods: noindex?
HtmlDocument#methods: description  keywords  redirect_url  title
instance variables: @document  @html  @metadata  @parsed_content  @url

Parse an HTML snippet

> url = 'foo'
> html = '<meta property="article:modified_time" content="2018-11-20 00:00:00" />'
> doc = HtmlDocument.new(document: html, url: url)
> doc.changed
2018-11-20 00:00:00 -0500

Parse an Application Document (PDF, DOC, etc.)

First, be sure you've downloaded Apache Tika and are running the Tika REST server:

$ tika-rest-server
> url = 'https://search.gov/pdf/2014-04-11-search-big-data.pdf'
> doc = ApplicationDocument.new(document: open(url), url: url)
> doc.title
> doc.parsed_content
"Search Is the New Big Data Search Is..."
> doc.title
"Search Is the New Big Data"