Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow search engines to index archived website #96

Open
nicolabingham opened this issue Jun 30, 2022 · 19 comments
Open

Allow search engines to index archived website #96

nicolabingham opened this issue Jun 30, 2022 · 19 comments
Assignees

Comments

@nicolabingham
Copy link

Please can we modify the public website robots.txt file so that this archived website: https://www.webarchive.org.uk/act/wayback/archive/20190313122106/http://www.europeandialogue.org/
can be indexed by search engines. This is a permission-cleared website which no longer exists on the live web, the content owners would like it to be discoverable.
I will submit it to Google so that they can find it afterwards.

@anjackson
Copy link
Contributor

Implemented, will roll-out with the other updates to the website.

@nicolabingham
Copy link
Author

Ah I'm so sorry, I've pasted the wrong URL into here, it should be https://www.webarchive.org.uk/wayback/archive/*/http://www.europeandialogue.org/

@anjackson
Copy link
Contributor

No worries I realized that.

@nicolabingham
Copy link
Author

Sorry, @anjackson can you help with the verification step for Google please?
I submitted the URL (https://www.webarchive.org.uk/wayback/archive/*/http://www.europeandialogue.org/) to Google for indexing, but Google requires a verification step which I'm struggling to complete. Is it possible to download the code provided by Google and upload it to the website? Alternatively, could you complete one of the other verification methods? Thanks
verify

@anjackson
Copy link
Contributor

We should be able to use the Google Analytics option, but it doesn't like the way the analytics have been installed. I'm rolling a ukwa/ukwa-pywb:2.6.7.3 release with the analytics code in the <head> of the page.

@anjackson
Copy link
Contributor

Er, in trying to fix this, ended up doing the registration. Tried to give you access too!

@anjackson
Copy link
Contributor

Is this all done now?

@nicolabingham
Copy link
Author

No, sorry, it hasn't been indexed.

anjackson added a commit to ukwa/ukwa-site that referenced this issue Jan 13, 2023
@anjackson
Copy link
Contributor

That addition to robots.txt got lost when we switched over to the new site system. I'll look into deploying it.

@anjackson
Copy link
Contributor

anjackson commented Jan 13, 2023

I used the Google robots.txt tester on the BETA version and it at least that part is working.

2023-01-13-robots-txt-tester

@anjackson
Copy link
Contributor

anjackson commented Jan 13, 2023

Okay, https://www.webarchive.org.uk/robots.txt is now updated. Looking it up in the search console the item is crawled but not indexed: https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=Ab4uuoLvcNyEFYNN6aDV3w&hl=en

The page was crawled by Google but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.

Not sure what this means. Perhaps it already picked up the change to robots.txt but hasn't got to indexing it yet. Worth re-trying in a day or two.

@anjackson
Copy link
Contributor

anjackson commented Feb 22, 2023

Ah, I was missing some subtleties in the robots.txt tester (not starting the test path at wayback/... but at e.g. /wayback/...), and having two separate Allow sections seem to confuse things. Updating to fix that and also allow sub-paths of this site to be indexed:

Allow: /wayback/archive/*/http://www.europeandialogue.org/*

anjackson added a commit to ukwa/ukwa-site that referenced this issue Feb 22, 2023
@anjackson
Copy link
Contributor

Okay, finally able to request indexing. Should hopefully turn up at https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=v-0hCfCrroprt3LcRIOgdw&alt_id=8syXgdFjnRsyHzfUNoQsfQ&hl=en before too long...

@anjackson
Copy link
Contributor

Hmm, it's taking a while. Perhaps we need to encourage it by linking to it from somewhere?

@anjackson
Copy link
Contributor

Tried blogging to encourage indexing (https://anjackson.net/2023/03/09/letting-search-engines-into-the-archive/), and it may help but it's not done much yet.

So, I'm looking at ensuring the ukwa-site site map gets indexed and adding page intended for search engines that links to the specific sites we're wanting indexed: ukwa/ukwa-site@dd457d5

@nicolabingham
Copy link
Author

Thanks for pursuing this one. Finger's crossed it will get indexed.

@anjackson
Copy link
Contributor

BTW, I've also started some changes that allow us to link such sites into the main site's sitemap. These are part of the changes on BETA, so when we're okay to move forward with that, we can see if that helps this issue.

@anjackson
Copy link
Contributor

One unexpected outcome from the IIPC conference was Daniel from PWA telling me that we should very much NOT do this! Apparently, the PWA got blocked as a dangerous website by Chrome, because the clever URL mashing that PyWB does for playback sets off some kind of alarm when crawled by Google. These lists of bad sites get passed around, and it seems they only managed to get off the list because they have a good relationship with Malwarebytes.

We should perhaps consider whether the right approach is to resurrect the idea of each Target having a public web page on the site. We then allow that to be indexed, which lets people find a link to the website rather than the site itself.

Or, perhaps it is possible to offer a different version of the website to crawlers, which does not do fancy re-writing.

@nicolabingham
Copy link
Author

Ah crikey! Good that you found this out from Daniel. I would favour the approach of each Target having a public web page on the site, if that's possible.

anjackson added a commit to ukwa/ukwa-site that referenced this issue May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants