-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow search engines to index archived website #96
Comments
Implemented, will roll-out with the other updates to the website. |
Ah I'm so sorry, I've pasted the wrong URL into here, it should be https://www.webarchive.org.uk/wayback/archive/*/http://www.europeandialogue.org/ |
No worries I realized that. |
Sorry, @anjackson can you help with the verification step for Google please? |
We should be able to use the Google Analytics option, but it doesn't like the way the analytics have been installed. I'm rolling a |
Er, in trying to fix this, ended up doing the registration. Tried to give you access too! |
Is this all done now? |
No, sorry, it hasn't been indexed. |
That addition to robots.txt got lost when we switched over to the new site system. I'll look into deploying it. |
I used the Google robots.txt tester on the BETA version and it at least that part is working. |
Okay, https://www.webarchive.org.uk/robots.txt is now updated. Looking it up in the search console the item is crawled but not indexed: https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=Ab4uuoLvcNyEFYNN6aDV3w&hl=en
Not sure what this means. Perhaps it already picked up the change to robots.txt but hasn't got to indexing it yet. Worth re-trying in a day or two. |
Ah, I was missing some subtleties in the robots.txt tester (not starting the test path at
|
Okay, finally able to request indexing. Should hopefully turn up at https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=v-0hCfCrroprt3LcRIOgdw&alt_id=8syXgdFjnRsyHzfUNoQsfQ&hl=en before too long... |
Hmm, it's taking a while. Perhaps we need to encourage it by linking to it from somewhere? |
Tried blogging to encourage indexing (https://anjackson.net/2023/03/09/letting-search-engines-into-the-archive/), and it may help but it's not done much yet. So, I'm looking at ensuring the |
Thanks for pursuing this one. Finger's crossed it will get indexed. |
BTW, I've also started some changes that allow us to link such sites into the main site's sitemap. These are part of the changes on BETA, so when we're okay to move forward with that, we can see if that helps this issue. |
One unexpected outcome from the IIPC conference was Daniel from PWA telling me that we should very much NOT do this! Apparently, the PWA got blocked as a dangerous website by Chrome, because the clever URL mashing that PyWB does for playback sets off some kind of alarm when crawled by Google. These lists of bad sites get passed around, and it seems they only managed to get off the list because they have a good relationship with Malwarebytes. We should perhaps consider whether the right approach is to resurrect the idea of each Target having a public web page on the site. We then allow that to be indexed, which lets people find a link to the website rather than the site itself. Or, perhaps it is possible to offer a different version of the website to crawlers, which does not do fancy re-writing. |
Ah crikey! Good that you found this out from Daniel. I would favour the approach of each Target having a public web page on the site, if that's possible. |
Please can we modify the public website robots.txt file so that this archived website: https://www.webarchive.org.uk/act/wayback/archive/20190313122106/http://www.europeandialogue.org/
can be indexed by search engines. This is a permission-cleared website which no longer exists on the live web, the content owners would like it to be discoverable.
I will submit it to Google so that they can find it afterwards.
The text was updated successfully, but these errors were encountered: