-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Globally added X-Robots-Tag: noindex HTTP response header #324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting a PR.
Make sure you have looked at CONTRIBUTING guidelines.
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1524/Result : success BIRDHOUSE_DEPLOY_BRANCH : x-robots-tag DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-118.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1105/NOTEBOOK TEST RESULTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this non-breaking for everyone, it should be defined as an optional-component.
In this component, a nginx conf would be added and mounted in the proxy
docker-compose service, such that in can be loaded by the include /etc/nginx/conf.d/*.conf;
step.
The instances that want to make use of this component will have to add it explicitly in EXTRA_CONF_DIRS
of their env.local
file.
Before working further on this PR, lets see first if anyone actually wish to be indexed by public search engines. |
Not sure I understand 100%, but I don't see why we'd want the staging and dev instances to be indexed. |
For staging and dev instances, I agree that indexing is most probably irrelevant. However, this definition can also be used for prod. So we must allow toggling the option if any indexing should be permitted for any use case. |
I think that the indexing is a problem even in prod. |
Maybe in the case of CRIM's instances, even prod indexing is not relevant because of our use cases, but other organizations could be taking advantage of it. For example, https://pavics.ouranos.ca/ entrypoint has a lot of documentation that indexes could facilitate search/matches. It could be the same eventually for U-Toronto that intends to swap the Jupyter Notebook interface by their own. For this reason, it should be an "opt-in" feature. |
I think for Ouranos, we want the homepage of PAVICS to be indexed so we can be discovered. Not sure I want all the data in Thredds and Geoserver to be indexed though. And no indexing for staging and test servers of course. Is there a way to exclude some path from being indexed? @huard @tlogan2000 Any problems to have our data on Thredds and Geoserver being indexed by Google? |
Oh geez ! Which log files we see this and what kind of log pattern? I'd like to check if we've been crawled too. |
I think I would prefer only the homepage as you suggest but have no real idea how much trouble it is to avoid having indexing on geoserver / thredds. Aany ballpark estimate in terms of amount of effort involved? |
I got the URL list from a report in Google Search console. If you want to look in the logs, you can always filter for specific user-agents, such as "Googlebot". Is the PAVICS homepage deployed using birdhouse-deploy too? If that's the case, then the nginx design makes things easy:
So, for example, in the location block of the homepage, all we have to do is to add |
Isn't there support for a |
Note the existence of https://developers.google.com/search/docs/appearance/structured-data/dataset So one option would be to turn off THREDDS indexing, but add formatted metadata information to our landing page describing the datasets we want to be indexed. |
There's a lot of overlap between the JSON-LD Dataset definition and the STAC definitions. |
The homepage is immediately at the root so I think overriding the root with However,
This means we should not forget to add this header for each new location directive we add to the proxy. Not so ideal but not sure what are other solutions since the homepage is at the root. Another solution is move the homepage to In all cases, I think this X-Robots-Tag should probably be optional and opt-in, unless we all agree that allow crawling of Thredds and Geoserver is a bad idea. |
Unfortunately, I don't know how the homepage is configured, but even if it's at the root, it should be within a |
Root location block here birdhouse-deploy/birdhouse/config/proxy/conf.d/all-services.include.template Lines 1 to 3 in 1981a1d
Default value when homepage not activated birdhouse-deploy/birdhouse/config/proxy/default.env Lines 9 to 11 in 1981a1d
Value when homepage activated birdhouse-deploy/birdhouse/env.local.example Line 219 in 1981a1d
But given this is the root location directive, doesn't it recursively set the value for all the sub location directives? It's possible I misunderstood how multiple location directives work together. |
I added the X-Robots-Tag header in the http block, which is parent to all location blocks, so if any location block has the add_header directive, then if will cancel the one from the http block. |
That seems to work ! @perronld Please document this opt-out for the homepage and describe this change in the CHANGES.md file (Usually some form of copy/paste of the PR description). @mishaschwartz @eyvorchuk any concern from UofT and PCIC side, about this globally denying seach engine indexing of PAVICS instances? Note an opt-out exists for the homepage. Before:
After:
|
@fmigneault does this Global X-Robots-Tag impact later implementation of #325 ? We want to advertize data on Thredds to various web crawlers/indexers? |
Each Not sure why the example before case of I don't really like the solution to inject a random header to cancel another one. Seems like a very obscure workaround that most won't understand intuitively. Seems like it would be easier to only add |
What if we used the default The all value is documented here: https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives |
That makes more sense since we will be working with the same |
Should we be extra sure and add all the "deny" there? |
I think |
@perronld @huard @tlvu @mishaschwartz @perronld EXTRA_CONFS_DIR='
<...>
optional-components/x-robots-tags-header
' |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1998/Result : failure BIRDHOUSE_DEPLOY_BRANCH : x-robots-tag DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-67.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1279/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1999/Result : failure BIRDHOUSE_DEPLOY_BRANCH : x-robots-tag DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-166.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1280/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2000/Result : failure BIRDHOUSE_DEPLOY_BRANCH : x-robots-tag DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-149.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1281/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2003/Result : failure BIRDHOUSE_DEPLOY_BRANCH : x-robots-tag DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-133.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1282/NOTEBOOK TEST RESULTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea. I just have one questions about the way its set up:
Is there a reason why this needs to be configured in an optional component? Why not just make it an environment variable in the proxy component? That seems like it would be simpler since this setting only changes a single line in the proxy configuration.
Or are you thinking that this could be expanded later to update additional header related settings that are not specifically for the proxy?
Using only an environment variable would work as well. export X_ROBOTS_TAGS='add_header X-Robots-Tags "noindex, nofollow";' I felt it was less error-prone to abstract that need for them to have the exact nginx syntax. For now, the component only overrides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for me as well. Please document the opt-out procedure in the optional-components/README.rst
For example, to allow crawling for the homepage, we would do this export PROXY_ROOT_LOCATION="alias /data/homepage/; add_header X-Robots-Tags: all;"
in env.local
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2010/Result : failure BIRDHOUSE_DEPLOY_BRANCH : x-robots-tag DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-46.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1287/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2011/Result : failure BIRDHOUSE_DEPLOY_BRANCH : x-robots-tag DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-46.rdext.crim.ca Infrastructure deployment failed. Instance has not been destroyed. @matprov |
## Overview Please include a summary of the changes and which issues are fixed. Please also include relevant motivation and context. List any dependencies that are required for this change. ## Changes **Non-breaking changes** - Fix incorrect tag ``X-Robots-Tags`` header to appropriate ``X-Robots-Tag`` (no final ``s``) name for ``optional-components/x-robots-tags-header``. **Breaking changes** - n/a ## Related Issue / Discussion - Relates to #324
Overview
Search engines currently indexes dev/staging installations of Birdhouse. This may create SEO issues for organizations.
Changes
optional-components/x-robots-tags-header
andX_ROBOTS_TAGS_HEADER
variable to allow setting the desiredheader value server-wide.
Breaking changes