-
Notifications
You must be signed in to change notification settings - Fork 39
How to Manage Searchgov Domains & indexed content
THIS DOCUMENT IS DEPRECATED! These how-tos have been moved to Confluence: https://cm.usa.gov/confluence/display/SRCH/How+to+Manage+Searchgov+Domains+and+Indexed+Content
- With Resque running, add a new domain on the Search.gov Domains super admin page.
- The
SearchgovDomainPreparerJob
will run in the background to set the domain scheme and status, and enqueue theSitemapIndexerJob
to index the sitemap urls. (Note, when testing, it helps to test a domain with a limited number of sitemap URLs, such assearch.gov
: https://search.gov/sitemap.xml) - Refresh the "Search.gov Domains" page to verify that the "URLs count" for your domain has increased. The "Unfetched URLs Count" will decrease as the URLs are indexed.
If a searchgov domain is unavailable due to a 403 block or a temporary issue, any indexing or crawling jobs will be blocked. The SearchgovDomain#check_status
will check and update the status for a given domain:
> sd = SearchgovDomain.find_by(domain: 'www.census.gov')
=> #<SearchgovDomain id: 929, domain: "www.census.gov", clean_urls: true, status: "403 Forbidden", urls_count: 24654, unfetched_urls_count: 24654, created_at: "2018-06-27 14:06:17", updated_at: "2018-06-27 14:06:18", scheme: "http">
# update the status
> sd.check_status
[Document Fetch] {"domain":"www.census.gov","time":"2018-07-02 16:07:40","type":"searchgov_domain","url":"http://www.census.gov/"}
=> "200 OK"
# update the activity if the `sd.activity` is "indexing"
> sd.done_indexing!
# restart indexing
> sd.index_sitemaps
If the domain indexing process dies due to a bug or other issue, you can often "un-stick" the stuck domain by performing the following steps:
pp sd = SearchgovDomain.find_by(domain: 'www.census.gov')
# Ensure that the domain is actually stuck (returning a '200', has unfetched URLs, but no URLs have been fetched for a suspiciously long time)
if sd.check_status == '200 OK' && sd.activity == 'indexing' && sd.searchgov_urls.fetch_required.any?
last_crawled_at = sd.searchgov_urls.maximum(:last_crawled_at)
if last_crawled_at && last_crawled_at < 6.hours.ago
sd.done_indexing!
sd.index_urls
end
end
Visit the Bulk Search.gov URL Upload page and follow the instructions.
- Ensure your CSV is properly delimited. If any urls include commas, they must be double-quoted.
- Add your CSV list of URLs to the
/home/search/dev_stuff/crawls/
directory - In the
/home/search/usasearch/current
directory, fire up the rails console:
$ bundle exec rails c
Create and index the SearchgovUrl records:
filepath = '/path/to/your/file/file.csv'
domain = 'whatever.gov' # the domain of the URLs in the file to be indexed
CSV.foreach(filepath) do |row|
url = row[0] #or the appropriate column
SearchgovUrl.create!(url: url)
puts "Created #{url}".green
rescue => e
if e.message == "Validation failed: Url has already been taken"
puts "Already exists: #{url}"
else
puts "Error processing #{url}: #{e}".red
end
end
# kick off indexing those URLs
sd = SearchgovDomain.find_by(domain: domain)
# if the sd.activity is already 'indexing', you are done;
# the new URLs will automatically be indexed.
# If the 'activity' is 'idle', kick off the indexing:
sd.index_urls
We automatically re-fetch and re-index all URLs every 30 days. However, sometimes a site admin will make a bulk change to the pages on their site, such as changing metadata or page structure, and they will request that we re-index their site. The "Reindex" button on the Searchgov Domains page in Super Admin will:
- set all their indexed URLs to be re-fetched (
enqueued_for_reindex = true
) - pull new sitemap URLs
- kick off indexing
It does NOT delete any URLs. Any URLs that now redirected or return 404's will automatically be removed from our search index in the process of re-fetching. (If for some reason you need to force a mass-deletion of URLs, see below.)
- Add your CSV list of URLs to the
/home/search/crawls/
directory (thecrawls
directory may need to be created as it is periodically blown away) - In the
/home/search/usasearch/current
directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/promotion_list.csv]
- Add your CSV list of URLs to the
/home/search/crawls/
directory (thecrawls
directory may need to be created as it is periodically blown away) - In the
/home/search/usasearch/current
directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/demotion_list.csv,false]
Note: Super admins can now add domains via the Search.gov Domains page. Newly added domains will automatically have their sitemaps indexed, if they are listed on robots.txt
OR located at whatever.gov/sitemap.xml
or whatever.gov/sitemap_index.xml
.
> searchgov_domain = SearchgovDomain.find_by(domain: 'www.foo.gov')
> searchgov_domain.index_sitemaps
SearchgovDomain.where('status = "200 OK"').each do |sd|
SitemapIndexerJob.perform_later(searchgov_domain: sd)
end
Caveat: If a sitemap contains redirected URLs, the sitemap indexing job may create additional new SearchgovUrl
records. It does not currently fetch those automatically (but it will). For now, you can follow up a sitemap indexing process by indexing any unfetched urls:
For one domain:
SearchgovDomain.find_by(domain: 'foo.gov').index_urls
For any domains with unfetched urls:
SearchgovDomain.where('unfetched_urls_count > 0').each{|sd| sd.index_urls }
Sometimes we are asked to index a sitemap at a given url that's neither /sitemap.xml
nor listed on robots.txt
. We're discouraging sites from doing that, because it prevents us and other search engines from indexing those automatically. However, super admins can now add those sitemap URLs via the Search.gov Domains page.
If for some reason you need to extract the URLs from a sitemap manually, you can index this way:
sitemap_url = 'https://whatever.gov/whatever_sitemap.xml'
SitemapIndexer.new(sitemap_url: sitemap_url).index
Domain deletion may involve the deletion of hundreds of thousands of SearchgovUrl
records, as well as the corresponding indexed documents in ElasticSearch. Deletion of that data can take a very long time, so SearchgovDomain
records should be deleted via a job:
> searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
> SearchgovDomainDestroyerJob.perform_later(searchgov_domain: searchgov_domain)
There are several ways to monitor a domain deletion job:
- In the console, run searchgov_domain.reload.urls_count. For a while, that count will drop, and eventually it will error, because the domain will no longer exist.
- Check the URLs count/ domain existence of that domain in https://search.usa.gov/admin/searchgov_domains
- Run the job synchronously in the console with
perform_now
instead ofperform_later
. (Note: you will see failures for many URLs. This is expected, as we make a DELETE request for every URL, not just the indexed ones, just in case something is out of sync between the Rails DB and the I14y index.)
The deletion job for very large domains (hundreds of thousands of URLs) may time out. If that occurs, you will need to run the destroyer job multiple times. A ticket to resolve this is in the backlog.
In general, we should avoid manually deleting URLs unless absolutely necessary. A fetch of an erroring URL automatically removes it from our search index. However, we occasionally want to mass delete a set of URLs (such as when a large domain updates their website, resulting in a prohibitive number of URLs 404'ing). In such cases, we can delete them via the console. (NOTE: searchgov_url.destroy includes a callback to remove the URL from our search index via the I14y API. This can be time-consuming, but it's safer to ensure we've removed the URL from our search index, even for not_ok
URLs.) Example:
searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
searchgov_domain.searchgov_urls.where(last_crawl_status: '404').find_each{ |su| su.destroy }
Hack alert! We do not yet have a clean way to pause indexing for a domain, so for now, the solution is to set the domain's status
to anything other than 200 OK
, and set the activity
back to idle
:
searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
sd.update!(status: 'temporarily pausing indexing - Jane Doe, 8/5/20', activity: 'idle')
Need to verify what data we're extracting from a web page or document? You just need the URL and/or an HTML snippet:
> url = 'https://search.gov/blog/sitemaps.html'
> doc = HtmlDocument.new(document: open(url).read, url: url)
> doc.title
"XML Sitemaps"
> doc.description
""
> puts doc.parsed_content
XML Sitemaps
An XML sitemap (External link) is an XML formatted file...
# other methods
> ls doc
WebDocument#methods: changed created document language metadata parsed_content url
RobotsTaggable#methods: noindex?
HtmlDocument#methods: description keywords redirect_url title
instance variables: @document @html @metadata @parsed_content @url
> url = 'foo'
> html = '<meta property="article:modified_time" content="2018-11-20 00:00:00" />'
> doc = HtmlDocument.new(document: html, url: url)
> doc.changed
2018-11-20 00:00:00 -0500
First, be sure you've downloaded Apache Tika and are running the Tika REST server:
$ tika-rest-server
> url = 'https://search.gov/pdf/2014-04-11-search-big-data.pdf'
> doc = ApplicationDocument.new(document: open(url), url: url)
> doc.title
> doc.parsed_content
"Search Is the New Big Data Search Is..."
> doc.title
"Search Is the New Big Data"