Skip to content

How to Manage Searchgov Domains & indexed content

Martha Thompson edited this page Oct 4, 2019 · 45 revisions

How to index a new domain

  • With Resque running, add a new domain on the Search.gov Domains super admin page.
  • The SearchgovDomainPreparerJob will run in the background to set the domain scheme and status, and enqueue the SitemapIndexerJob to index the sitemap urls. (Note, when testing, it helps to test a domain with a limited number of sitemap URLs, such as search.gov: https://search.gov/sitemap.xml)
  • Refresh the "Search.gov Domains" page to verify that the "URLs count" for your domain has increased. The "Unfetched URLs Count" will decrease as the URLs are indexed.

How to update a domain's status

If a searchgov domain is unavailable due to a 403 block or a temporary issue, any indexing or crawling jobs will be blocked. The SearchgovDomain#check_status will check and update the status for a given domain:

> sd = SearchgovDomain.find_by(domain: 'www.census.gov')
=> #<SearchgovDomain id: 929, domain: "www.census.gov", clean_urls: true, status: "403 Forbidden", urls_count: 24654, unfetched_urls_count: 24654, created_at: "2018-06-27 14:06:17", updated_at: "2018-06-27 14:06:18", scheme: "http">
# update the status
> sd.check_status
[Document Fetch] {"domain":"www.census.gov","time":"2018-07-02 16:07:40","type":"searchgov_domain","url":"http://www.census.gov/"}
=> "200 OK"
# update the activity if the `sd.activity` is "indexing"
> sd.done_indexing!
# restart indexing
> sd.index_sitemaps

How to bulk index a list of URLs

  1. Add your CSV list of URLs to the /home/search/crawls/ directory
  2. In the /home/search/usasearch/current directory, fire up the rails console:
$ bundle exec rails c

Create and index the SearchgovUrl records:

filepath = '/path/to/your/file/file.csv'

CSV.foreach(filepath) do |row|
  url = row[0] #or the appropriate column
  SearchgovUrl.create!(url: url)
  puts "Created #{url}".green
rescue => e
  puts "Error processing #{url}: #{e}".red
end

# kick off indexing those URLs
sd = SearchgovDomain.find_by(domain: 'whatever.gov')

# if the sd.activity is already 'indexing', you are done;
# the new URLs will automatically be indexed.
# If the 'activity' is 'idle', kick off the indexing:
sd.index_urls

NOTE: There is a bulk_index rake task that should not be used until it's updated to simply create the URLs (as above), but not fetch them: https://www.pivotaltracker.com/story/show/157270274

How to promote a list of URLs

  1. Add your CSV list of URLs to the /home/search/crawls/ directory
  2. In the /home/search/usasearch/current directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/promotion_list.csv]

How to demote a list of URLs

  1. Add your CSV list of URLs to the /home/search/crawls/ directory
  2. In the /home/search/usasearch/current directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/demotion_list.csv,false]

How to Index Sitemaps

Index a sitemap for one domain

Note: Super admins can now add domains via the Search.gov Domains page. Newly added domains will automatically have their sitemaps indexed, if they are listed on robots.txt OR located at whatever.gov/sitemap.xml or whatever.gov/sitemap_index.xml.

> searchgov_domain = SearchgovDomain.find_by(domain: 'www.foo.gov')
> searchgov_domain.index_sitemap

Index sitemaps for all domains

SearchgovDomain.where('status = "200 OK"').each do |sd|
  SitemapIndexerJob.perform_later(searchgov_domain: sd)
end

Caveat: If a sitemap contains redirected URLs, the sitemap indexing job may create additional new SearchgovUrl records. It does not currently fetch those automatically (but it will). For now, you can follow up a sitemap indexing process by indexing any unfetched urls:

For one domain:

SearchgovDomain.find_by(domain: 'foo.gov').index_urls

For any domains with unfetched urls:

SearchgovDomain.where('unfetched_urls_count > 0').each{|sd| sd.index_urls }

Index URLs from any given sitemap

Sometimes we are asked to index a sitemap at a given url that's neither /sitemap.xml nor listed on robots.txt. We're discouraging sites from doing that, because it prevents us and other search engines from indexing those automatically. However, super admins can now add those sitemap URLs via the Search.gov Domains page.

If for some reason you need to extract the URLs from a sitemap manually, you can index this way:

sitemap_url = 'https://whatever.gov/whatever_sitemap.xml'
# extract the sitemap entries
entries = Sitemaps.fetch(sitemap_url).entries ; nil

# create the SearchgovUrl records
entries.each do |entry|
  SearchgovUrl.find_or_create_by!(url: entry.loc.to_s).tap do |su|
    su.update(lastmod: entry.lastmod)
  end
  puts "Created or updated #{entry.loc}".green
rescue => e
  puts "Failed to create #{entry.loc}: #{e}".red
end

# trigger domain indexing
sd = SearchgovDomain.find_by(domain: 'whatever.gov')
sd.index_urls!

For additional options, refer to the sitemaps gem documentation.

How to delete a domain

Domain deletion may involve the deletion of hundreds of thousands of SearchgovUrl records, as well as the corresponding indexed documents in ElasticSearch. Deletion of that data can take a very long time, so SearchgovDomain records should be deleted via a job:

> searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
> SearchgovDomainDestroyerJob.perform_later(searchgov_domain: searchgov_domain)

How to check HTML/web document parsing

Need to verify what data we're extracting from a web page or document? You just need the URL and/or an HTML snippet:

Parse an HTML document

> url = 'https://search.gov/blog/sitemaps.html'
> doc = HtmlDocument.new(document: open(url).read, url: url)
> doc.title
"XML Sitemaps"
> doc.description
""
> puts doc.parsed_content
XML Sitemaps
An XML sitemap  (External link) is an XML formatted file...

# other methods
> ls doc
WebDocument#methods: changed  created  document  language  metadata  parsed_content  url
RobotsTaggable#methods: noindex?
HtmlDocument#methods: description  keywords  redirect_url  title
instance variables: @document  @html  @metadata  @parsed_content  @url

Parse an HTML snippet

> url = 'foo'
> html = '<meta property="article:modified_time" content="2018-11-20 00:00:00" />'
> doc = HtmlDocument.new(document: html, url: url)
> doc.changed
2018-11-20 00:00:00 -0500

Parse an Application Document (PDF, DOC, etc.)

First, be sure you've downloaded Apache Tika and are running the Tika REST server:

$ tika-rest-server
> url = 'https://search.gov/pdf/2014-04-11-search-big-data.pdf'
> doc = ApplicationDocument.new(document: open(url), url: url)
> doc.title
> doc.parsed_content
"Search Is the New Big Data Search Is..."
> doc.title
"Search Is the New Big Data"