-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid excess load of bots going into search facet links on entity pages #2709
Comments
Thanks @bram-atmire! I can imagine this is a huge load (like crawling search and browse as well) and an obvious win for bots that respect robots.txt. I'm wondering if Google's interpretation of the robot exclusion protocol supports wildcards such as this after path elements. It seems maybe? Have you tried it on a live site? As a sysadmin I'd block these patterns in Apache / nginx just to be sure—as the Russian saying goes: "trust, but verify". Side note, we have several patterns with trailing wildcards that will be ignored by Google bot. |
@alanorth As far as I cansee, as long as the wild card isn't trailing, it shouldn't be ignored. The change in this ticket came up in an email dialogue with a representative from Google Scholar. One site where we have it in prod: https://repository.upenn.edu/robots.txt |
Wouldn't it be useful to (additionally) use add the |
@hutattedonmyarm if we use |
@alanorth Not the whole page, I was only talking about the links in |
@hutattedonmyarm oh yes, I was confusing the |
Describe the bug
We're seeing in search console for several of our clients that bots go into facet links on entity pages. Given that this doesn't contribute to the quality of the indexing (e.g. bots shouldn't be going there) and that processing these requests is resource intensive, we better avoid this behaviour al together.
To Reproduce
Steps to reproduce the behavior:
entities/orgunit/25913818-6714-4be5-89a6-f70c8facdf7e?f.author=Wang
Expected behavior
Robots should be blocked from doing this
Proposed solution
Add following disallow directive in robots.txt:
Disallow: /entities/*?f
Related work
Previously incorrectly created in the back-end Git repo as DSpace/DSpace#9227
The text was updated successfully, but these errors were encountered: