Allow search engines to index archived website #96

nicolabingham · 2022-06-30T13:41:23Z

Please can we modify the public website robots.txt file so that this archived website: https://www.webarchive.org.uk/act/wayback/archive/20190313122106/http://www.europeandialogue.org/
can be indexed by search engines. This is a permission-cleared website which no longer exists on the live web, the content owners would like it to be discoverable.
I will submit it to Google so that they can find it afterwards.

anjackson · 2022-06-30T13:58:35Z

Implemented, will roll-out with the other updates to the website.

nicolabingham · 2022-06-30T14:12:43Z

Ah I'm so sorry, I've pasted the wrong URL into here, it should be https://www.webarchive.org.uk/wayback/archive/*/http://www.europeandialogue.org/

anjackson · 2022-06-30T14:42:57Z

No worries I realized that.

nicolabingham · 2022-07-14T17:18:40Z

Sorry, @anjackson can you help with the verification step for Google please?
I submitted the URL (https://www.webarchive.org.uk/wayback/archive/*/http://www.europeandialogue.org/) to Google for indexing, but Google requires a verification step which I'm struggling to complete. Is it possible to download the code provided by Google and upload it to the website? Alternatively, could you complete one of the other verification methods? Thanks

anjackson · 2022-07-28T14:28:57Z

We should be able to use the Google Analytics option, but it doesn't like the way the analytics have been installed. I'm rolling a ukwa/ukwa-pywb:2.6.7.3 release with the analytics code in the <head> of the page.

anjackson · 2022-07-28T16:32:54Z

Er, in trying to fix this, ended up doing the registration. Tried to give you access too!

anjackson · 2022-09-22T12:29:00Z

Is this all done now?

nicolabingham · 2022-09-30T15:10:22Z

No, sorry, it hasn't been indexed.

As per ukwa/ukwa-services#96

anjackson · 2023-01-13T10:53:40Z

That addition to robots.txt got lost when we switched over to the new site system. I'll look into deploying it.

anjackson · 2023-01-13T11:45:17Z

I used the Google robots.txt tester on the BETA version and it at least that part is working.

anjackson · 2023-01-13T12:12:29Z

Okay, https://www.webarchive.org.uk/robots.txt is now updated. Looking it up in the search console the item is crawled but not indexed: https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=Ab4uuoLvcNyEFYNN6aDV3w&hl=en

The page was crawled by Google but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.

Not sure what this means. Perhaps it already picked up the change to robots.txt but hasn't got to indexing it yet. Worth re-trying in a day or two.

anjackson · 2023-02-22T08:40:28Z

Ah, I was missing some subtleties in the robots.txt tester (not starting the test path at wayback/... but at e.g. /wayback/...), and having two separate Allow sections seem to confuse things. Updating to fix that and also allow sub-paths of this site to be indexed:

Allow: /wayback/archive/*/http://www.europeandialogue.org/*

anjackson · 2023-02-22T09:53:11Z

Okay, finally able to request indexing. Should hopefully turn up at https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=v-0hCfCrroprt3LcRIOgdw&alt_id=8syXgdFjnRsyHzfUNoQsfQ&hl=en before too long...

anjackson · 2023-03-01T13:46:01Z

Hmm, it's taking a while. Perhaps we need to encourage it by linking to it from somewhere?

anjackson · 2023-03-14T09:38:13Z

Tried blogging to encourage indexing (https://anjackson.net/2023/03/09/letting-search-engines-into-the-archive/), and it may help but it's not done much yet.

So, I'm looking at ensuring the ukwa-site site map gets indexed and adding page intended for search engines that links to the specific sites we're wanting indexed: ukwa/ukwa-site@dd457d5

nicolabingham · 2023-03-14T10:18:58Z

Thanks for pursuing this one. Finger's crossed it will get indexed.

anjackson · 2023-05-04T09:49:00Z

BTW, I've also started some changes that allow us to link such sites into the main site's sitemap. These are part of the changes on BETA, so when we're okay to move forward with that, we can see if that helps this issue.

anjackson · 2023-05-17T14:12:42Z

One unexpected outcome from the IIPC conference was Daniel from PWA telling me that we should very much NOT do this! Apparently, the PWA got blocked as a dangerous website by Chrome, because the clever URL mashing that PyWB does for playback sets off some kind of alarm when crawled by Google. These lists of bad sites get passed around, and it seems they only managed to get off the list because they have a good relationship with Malwarebytes.

We should perhaps consider whether the right approach is to resurrect the idea of each Target having a public web page on the site. We then allow that to be indexed, which lets people find a link to the website rather than the site itself.

Or, perhaps it is possible to offer a different version of the website to crawlers, which does not do fancy re-writing.

nicolabingham · 2023-05-17T14:16:09Z

Ah crikey! Good that you found this out from Daniel. I would favour the approach of each Target having a public web page on the site, if that's possible.

…ing.

nicolabingham assigned anjackson and nicolabingham Jun 30, 2022

anjackson added a commit that referenced this issue Jun 30, 2022

Allow an archived site to be indexed by search engines, see #96.

e39facb

anjackson added a commit to ukwa/ukwa-site that referenced this issue Jan 13, 2023

Add robots allow for a single site

3bbd506

As per ukwa/ukwa-services#96

anjackson added a commit that referenced this issue Jan 13, 2023

Update to latest PyWB and robots.txt changes for #96.

6795392

anjackson added a commit to ukwa/ukwa-site that referenced this issue Feb 22, 2023

Allow whole site indexing, for ukwa/ukwa-services#96

f7f64a4

anjackson added a commit to ukwa/ukwa-site that referenced this issue May 25, 2023

Removing Allow from ukwa/ukwa-services#96 due to concerns about block…

0358640

…ing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow search engines to index archived website #96

Allow search engines to index archived website #96

nicolabingham commented Jun 30, 2022

anjackson commented Jun 30, 2022

nicolabingham commented Jun 30, 2022

anjackson commented Jun 30, 2022

nicolabingham commented Jul 14, 2022

anjackson commented Jul 28, 2022

anjackson commented Jul 28, 2022

anjackson commented Sep 22, 2022

nicolabingham commented Sep 30, 2022

anjackson commented Jan 13, 2023

anjackson commented Jan 13, 2023 •

edited

Loading

anjackson commented Jan 13, 2023 •

edited

Loading

anjackson commented Feb 22, 2023 •

edited

Loading

anjackson commented Feb 22, 2023

anjackson commented Mar 1, 2023

anjackson commented Mar 14, 2023

nicolabingham commented Mar 14, 2023

anjackson commented May 4, 2023

anjackson commented May 17, 2023

nicolabingham commented May 17, 2023

Allow search engines to index archived website #96

Allow search engines to index archived website #96

Comments

nicolabingham commented Jun 30, 2022

anjackson commented Jun 30, 2022

nicolabingham commented Jun 30, 2022

anjackson commented Jun 30, 2022

nicolabingham commented Jul 14, 2022

anjackson commented Jul 28, 2022

anjackson commented Jul 28, 2022

anjackson commented Sep 22, 2022

nicolabingham commented Sep 30, 2022

anjackson commented Jan 13, 2023

anjackson commented Jan 13, 2023 • edited Loading

anjackson commented Jan 13, 2023 • edited Loading

anjackson commented Feb 22, 2023 • edited Loading

anjackson commented Feb 22, 2023

anjackson commented Mar 1, 2023

anjackson commented Mar 14, 2023

nicolabingham commented Mar 14, 2023

anjackson commented May 4, 2023

anjackson commented May 17, 2023

nicolabingham commented May 17, 2023

anjackson commented Jan 13, 2023 •

edited

Loading

anjackson commented Jan 13, 2023 •

edited

Loading

anjackson commented Feb 22, 2023 •

edited

Loading