Skip to content

How to Manage Searchgov Domains & indexed content

Martha Thompson edited this page Aug 5, 2020 · 45 revisions

How to index a new domain

  • With Resque running, add a new domain on the Search.gov Domains super admin page.
  • The SearchgovDomainPreparerJob will run in the background to set the domain scheme and status, and enqueue the SitemapIndexerJob to index the sitemap urls. (Note, when testing, it helps to test a domain with a limited number of sitemap URLs, such as search.gov: https://search.gov/sitemap.xml)
  • Refresh the "Search.gov Domains" page to verify that the "URLs count" for your domain has increased. The "Unfetched URLs Count" will decrease as the URLs are indexed.

How to update a domain's status

If a searchgov domain is unavailable due to a 403 block or a temporary issue, any indexing or crawling jobs will be blocked. The SearchgovDomain#check_status will check and update the status for a given domain:

> sd = SearchgovDomain.find_by(domain: 'www.census.gov')
=> #<SearchgovDomain id: 929, domain: "www.census.gov", clean_urls: true, status: "403 Forbidden", urls_count: 24654, unfetched_urls_count: 24654, created_at: "2018-06-27 14:06:17", updated_at: "2018-06-27 14:06:18", scheme: "http">
# update the status
> sd.check_status
[Document Fetch] {"domain":"www.census.gov","time":"2018-07-02 16:07:40","type":"searchgov_domain","url":"http://www.census.gov/"}
=> "200 OK"
# update the activity if the `sd.activity` is "indexing"
> sd.done_indexing!
# restart indexing
> sd.index_sitemaps

How to bulk index a list of URLs

  1. Add your CSV list of URLs to the /home/search/dev_stuff/crawls/ directory
  2. In the /home/search/usasearch/current directory, fire up the rails console:
$ bundle exec rails c

Create and index the SearchgovUrl records:

filepath = '/path/to/your/file/file.csv'
domain = 'whatever.gov' # the domain of the URLs in the file to be indexed

CSV.foreach(filepath) do |row|
  url = row[0] #or the appropriate column
  SearchgovUrl.create!(url: url)
  puts "Created #{url}".green
rescue => e
  puts "Error processing #{url}: #{e}".red
end

# kick off indexing those URLs
sd = SearchgovDomain.find_by(domain: domain)

# if the sd.activity is already 'indexing', you are done;
# the new URLs will automatically be indexed.
# If the 'activity' is 'idle', kick off the indexing:
sd.index_urls

NOTE: There is a bulk_index rake task that should not be used until it's updated to simply create the URLs (as above), but not fetch them: https://www.pivotaltracker.com/story/show/157270274

How to re-index all URLs for a domain

We automatically re-fetch and re-index all URLs every 30 days. However, sometimes a site admin will make a bulk change to the pages on their site, such as changing metadata or page structure, and they will request that we re-index their site. To do this, we need to set all their indexed URLs to enqueued_for_reindex = true, and kick off indexing. Example:

> sd = SearchgovDomain.find_by(domain: 'oig.hhs.gov')
=> #<SearchgovDomain id: 1356, domain: "oig.hhs.gov", clean_urls: true, status: "200 OK", urls_count: 15203, unfetched_urls_count: 0, created_at: "2018-11-15 01:00:28", updated_at: "2019-10-25 13:23:35", scheme: "https", activity: "idle", canonical_domain: nil>

irb(main):036:0> sd.searchgov_urls.ok.in_batches.update_all(enqueued_for_reindex: true)
=> 14354

# Pick up their updated sitemap & kick off indexing
irb(main):037:0> sd.index_sitemaps 

How to promote a list of URLs

  1. Add your CSV list of URLs to the /home/search/crawls/ directory
  2. In the /home/search/usasearch/current directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/promotion_list.csv]

How to demote a list of URLs

  1. Add your CSV list of URLs to the /home/search/crawls/ directory
  2. In the /home/search/usasearch/current directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/demotion_list.csv,false]

How to Index Sitemaps

Index a sitemap for one domain

Note: Super admins can now add domains via the Search.gov Domains page. Newly added domains will automatically have their sitemaps indexed, if they are listed on robots.txt OR located at whatever.gov/sitemap.xml or whatever.gov/sitemap_index.xml.

> searchgov_domain = SearchgovDomain.find_by(domain: 'www.foo.gov')
> searchgov_domain.index_sitemaps

Index sitemaps for all domains

SearchgovDomain.where('status = "200 OK"').each do |sd|
  SitemapIndexerJob.perform_later(searchgov_domain: sd)
end

Caveat: If a sitemap contains redirected URLs, the sitemap indexing job may create additional new SearchgovUrl records. It does not currently fetch those automatically (but it will). For now, you can follow up a sitemap indexing process by indexing any unfetched urls:

For one domain:

SearchgovDomain.find_by(domain: 'foo.gov').index_urls

For any domains with unfetched urls:

SearchgovDomain.where('unfetched_urls_count > 0').each{|sd| sd.index_urls }

Index URLs from any given sitemap

Sometimes we are asked to index a sitemap at a given url that's neither /sitemap.xml nor listed on robots.txt. We're discouraging sites from doing that, because it prevents us and other search engines from indexing those automatically. However, super admins can now add those sitemap URLs via the Search.gov Domains page.

If for some reason you need to extract the URLs from a sitemap manually, you can index this way:

sitemap_url = 'https://whatever.gov/whatever_sitemap.xml'
# extract the sitemap entries
entries = Sitemaps.fetch(sitemap_url).entries ; nil

# create the SearchgovUrl records
entries.each do |entry|
  SearchgovUrl.find_or_create_by!(url: entry.loc.to_s).tap do |su|
    su.update(lastmod: entry.lastmod)
  end
  puts "Created or updated #{entry.loc}".green
rescue => e
  puts "Failed to create #{entry.loc}: #{e}".red
end

# trigger domain indexing
sd = SearchgovDomain.find_by(domain: 'whatever.gov')
sd.index_urls!

For additional options, refer to the sitemaps gem documentation.

How to delete a domain

Domain deletion may involve the deletion of hundreds of thousands of SearchgovUrl records, as well as the corresponding indexed documents in ElasticSearch. Deletion of that data can take a very long time, so SearchgovDomain records should be deleted via a job:

> searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
> SearchgovDomainDestroyerJob.perform_later(searchgov_domain: searchgov_domain)

There are several ways to monitor a domain deletion job:

  • In the console, run searchgov_domain.reload.urls_count. For a while, that count will drop, and eventually it will error, because the domain will no longer exist.
  • Check the URLs count/ domain existence of that domain in https://search.usa.gov/admin/searchgov_domains
  • Run the job synchronously in the console with perform_now instead of perform_later. (Note: you will see failures for many URLs. This is expected, as we make a DELETE request for every URL, not just the indexed ones, just in case something is out of sync between the Rails DB and the I14y index.)

The deletion job for very large domains (hundreds of thousands of URLs) may time out. If that occurs, you will need to run the destroyer job multiple times. A ticket to resolve this is in the backlog.

How to pause indexing for a given domain

Hack alert! We do not yet have a clean way to pause indexing for a domain, so for now, the solution is to set the domain's status to anything other than 200 OK, and set the activity back to idle:

searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
sd.update!(status: 'temporarily pausing indexing - Jane Doe, 8/5/20', activity: 'idle')

How to check HTML/web document parsing

Need to verify what data we're extracting from a web page or document? You just need the URL and/or an HTML snippet:

Parse an HTML document

> url = 'https://search.gov/blog/sitemaps.html'
> doc = HtmlDocument.new(document: open(url).read, url: url)
> doc.title
"XML Sitemaps"
> doc.description
""
> puts doc.parsed_content
XML Sitemaps
An XML sitemap  (External link) is an XML formatted file...

# other methods
> ls doc
WebDocument#methods: changed  created  document  language  metadata  parsed_content  url
RobotsTaggable#methods: noindex?
HtmlDocument#methods: description  keywords  redirect_url  title
instance variables: @document  @html  @metadata  @parsed_content  @url

Parse an HTML snippet

> url = 'foo'
> html = '<meta property="article:modified_time" content="2018-11-20 00:00:00" />'
> doc = HtmlDocument.new(document: html, url: url)
> doc.changed
2018-11-20 00:00:00 -0500

Parse an Application Document (PDF, DOC, etc.)

First, be sure you've downloaded Apache Tika and are running the Tika REST server:

$ tika-rest-server
> url = 'https://search.gov/pdf/2014-04-11-search-big-data.pdf'
> doc = ApplicationDocument.new(document: open(url), url: url)
> doc.title
> doc.parsed_content
"Search Is the New Big Data Search Is..."
> doc.title
"Search Is the New Big Data"