-
Notifications
You must be signed in to change notification settings - Fork 39
How to Manage Searchgov Domains & indexed content
- With Resque running, add a new domain on the Search.gov Domains super admin page.
- The
SearchgovDomainPreparerJob
will run in the background to set the domain scheme and status, and enqueue theSitemapIndexerJob
to index the sitemap urls. (Note, when testing, it helps to test a domain with a limited number of sitemap URLs, such assearch.gov
: https://search.gov/sitemap.xml) - Refresh the "Search.gov Domains" page to verify that the "URLs count" for your domain has increased. The "Unfetched URLs Count" will decrease as the URLs are indexed.
If a searchgov domain is unavailable due to a 403 block or a temporary issue, any indexing or crawling jobs will be blocked. The SearchgovDomain#check_status
will check and update the status for a given domain:
> sd = SearchgovDomain.find_by(domain: 'www.census.gov')
=> #<SearchgovDomain id: 929, domain: "www.census.gov", clean_urls: true, status: "403 Forbidden", urls_count: 24654, unfetched_urls_count: 24654, created_at: "2018-06-27 14:06:17", updated_at: "2018-06-27 14:06:18", scheme: "http">
# update the status
> sd.check_status
[Document Fetch] {"domain":"www.census.gov","time":"2018-07-02 16:07:40","type":"searchgov_domain","url":"http://www.census.gov/"}
=> "200 OK"
# update the activity if the `sd.activity` is "indexing"
> sd.done_indexing!
# restart indexing
> sd.index_sitemaps
- Add your CSV list of URLs to the
/home/search/crawls/
directory - In the
/home/search/usasearch/current
directory, fire up the rails console:
$ bundle exec rails c
Create and index the SearchgovUrl records:
filepath = '/path/to/your/file/file.csv'
domain = 'whatever.gov' # the domain of the URLs in the file to be indexed
CSV.foreach(filepath) do |row|
url = row[0] #or the appropriate column
SearchgovUrl.create!(url: url)
puts "Created #{url}".green
rescue => e
puts "Error processing #{url}: #{e}".red
end
# kick off indexing those URLs
sd = SearchgovDomain.find_by(domain: domain)
# if the sd.activity is already 'indexing', you are done;
# the new URLs will automatically be indexed.
# If the 'activity' is 'idle', kick off the indexing:
sd.index_urls
NOTE: There is a bulk_index
rake task that should not be used until it's updated to simply create the URLs (as above), but not fetch them: https://www.pivotaltracker.com/story/show/157270274
We automatically re-fetch and re-index all URLs every 30 days. However, sometimes a site admin will make a bulk change to the pages on their site, such as changing metadata or page structure, and they will request that we re-index their site. To do this, we need to set all their indexed URLs to enqueued_for_reindex = true
, and kick off indexing. Example:
> sd = SearchgovDomain.find_by(domain: 'oig.hhs.gov')
=> #<SearchgovDomain id: 1356, domain: "oig.hhs.gov", clean_urls: true, status: "200 OK", urls_count: 15203, unfetched_urls_count: 0, created_at: "2018-11-15 01:00:28", updated_at: "2019-10-25 13:23:35", scheme: "https", activity: "idle", canonical_domain: nil>
irb(main):035:0> sd.searchgov_urls.ok.count
=> 14354
irb(main):036:0> sd.searchgov_urls.ok.in_batches.update_all(enqueued_for_reindex: true)
=> 14354
irb(main):037:0> sd.index_urls
Enqueued SearchgovDomainIndexerJob (Job ID: f10f6b56-a3d7-461c-a9d2-af015d1e3de2) to Resque(searchgov) with arguments: {:searchgov_domain=>#<GlobalID:0x000000000b4ce900 @uri=#<URI::GID gid://usasearch/SearchgovDomain/1356>>, :delay=>2}
- Add your CSV list of URLs to the
/home/search/crawls/
directory - In the
/home/search/usasearch/current
directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/promotion_list.csv]
- Add your CSV list of URLs to the
/home/search/crawls/
directory - In the
/home/search/usasearch/current
directory, run:
$ bundle exec rake searchgov:promote[/home/search/crawls/demotion_list.csv,false]
Note: Super admins can now add domains via the Search.gov Domains page. Newly added domains will automatically have their sitemaps indexed, if they are listed on robots.txt
OR located at whatever.gov/sitemap.xml
or whatever.gov/sitemap_index.xml
.
> searchgov_domain = SearchgovDomain.find_by(domain: 'www.foo.gov')
> searchgov_domain.index_sitemaps
SearchgovDomain.where('status = "200 OK"').each do |sd|
SitemapIndexerJob.perform_later(searchgov_domain: sd)
end
Caveat: If a sitemap contains redirected URLs, the sitemap indexing job may create additional new SearchgovUrl
records. It does not currently fetch those automatically (but it will). For now, you can follow up a sitemap indexing process by indexing any unfetched urls:
For one domain:
SearchgovDomain.find_by(domain: 'foo.gov').index_urls
For any domains with unfetched urls:
SearchgovDomain.where('unfetched_urls_count > 0').each{|sd| sd.index_urls }
Sometimes we are asked to index a sitemap at a given url that's neither /sitemap.xml
nor listed on robots.txt
. We're discouraging sites from doing that, because it prevents us and other search engines from indexing those automatically. However, super admins can now add those sitemap URLs via the Search.gov Domains page.
If for some reason you need to extract the URLs from a sitemap manually, you can index this way:
sitemap_url = 'https://whatever.gov/whatever_sitemap.xml'
# extract the sitemap entries
entries = Sitemaps.fetch(sitemap_url).entries ; nil
# create the SearchgovUrl records
entries.each do |entry|
SearchgovUrl.find_or_create_by!(url: entry.loc.to_s).tap do |su|
su.update(lastmod: entry.lastmod)
end
puts "Created or updated #{entry.loc}".green
rescue => e
puts "Failed to create #{entry.loc}: #{e}".red
end
# trigger domain indexing
sd = SearchgovDomain.find_by(domain: 'whatever.gov')
sd.index_urls!
For additional options, refer to the sitemaps gem documentation.
Domain deletion may involve the deletion of hundreds of thousands of SearchgovUrl
records, as well as the corresponding indexed documents in ElasticSearch. Deletion of that data can take a very long time, so SearchgovDomain
records should be deleted via a job:
> searchgov_domain = SearchgovDomain.find_by(domain: 'foo.gov')
> SearchgovDomainDestroyerJob.perform_later(searchgov_domain: searchgov_domain)
Need to verify what data we're extracting from a web page or document? You just need the URL and/or an HTML snippet:
> url = 'https://search.gov/blog/sitemaps.html'
> doc = HtmlDocument.new(document: open(url).read, url: url)
> doc.title
"XML Sitemaps"
> doc.description
""
> puts doc.parsed_content
XML Sitemaps
An XML sitemap (External link) is an XML formatted file...
# other methods
> ls doc
WebDocument#methods: changed created document language metadata parsed_content url
RobotsTaggable#methods: noindex?
HtmlDocument#methods: description keywords redirect_url title
instance variables: @document @html @metadata @parsed_content @url
> url = 'foo'
> html = '<meta property="article:modified_time" content="2018-11-20 00:00:00" />'
> doc = HtmlDocument.new(document: html, url: url)
> doc.changed
2018-11-20 00:00:00 -0500
First, be sure you've downloaded Apache Tika and are running the Tika REST server:
$ tika-rest-server
> url = 'https://search.gov/pdf/2014-04-11-search-big-data.pdf'
> doc = ApplicationDocument.new(document: open(url), url: url)
> doc.title
> doc.parsed_content
"Search Is the New Big Data Search Is..."
> doc.title
"Search Is the New Big Data"