Avoid excess load of bots going into search facet links on entity pages #2709

bram-atmire · 2023-12-12T10:14:12Z

Describe the bug
We're seeing in search console for several of our clients that bots go into facet links on entity pages. Given that this doesn't contribute to the quality of the indexing (e.g. bots shouldn't be going there) and that processing these requests is resource intensive, we better avoid this behaviour al together.

To Reproduce
Steps to reproduce the behavior:

Look at search console for an actively indexed DSpace 7 site, that has entities enabled
Look for the patterns in the reports of crawled urls for things like:

entities/orgunit/25913818-6714-4be5-89a6-f70c8facdf7e?f.author=Wang

Expected behavior
Robots should be blocked from doing this

Proposed solution
Add following disallow directive in robots.txt:

Disallow: /entities/*?f

Related work

Previously incorrectly created in the back-end Git repo as DSpace/DSpace#9227

The text was updated successfully, but these errors were encountered:

Fix for issue #2709

alanorth · 2024-01-12T05:56:55Z

Thanks @bram-atmire! I can imagine this is a huge load (like crawling search and browse as well) and an obvious win for bots that respect robots.txt. I'm wondering if Google's interpretation of the robot exclusion protocol supports wildcards such as this after path elements. It seems maybe? Have you tried it on a live site?

As a sysadmin I'd block these patterns in Apache / nginx just to be sure—as the Russian saying goes: "trust, but verify".

Side note, we have several patterns with trailing wildcards that will be ignored by Google bot.

bram-atmire · 2024-02-05T11:01:12Z

@alanorth As far as I cansee, as long as the wild card isn't trailing, it shouldn't be ignored.

The change in this ticket came up in an email dialogue with a representative from Google Scholar.

One site where we have it in prod: https://repository.upenn.edu/robots.txt

hutattedonmyarm · 2024-03-05T12:44:18Z

Wouldn't it be useful to (additionally) use add the rel="nofollow" attribute to the anchor tags in the search filters? This way we don't have to rely on how wildcards are handled by crawlers

alanorth · 2024-03-06T05:36:34Z

@hutattedonmyarm if we use rel="nofollow" on search pages it would be a sign for bots to not crawl them, but they still have to load the page to read the anchor tags. In theory the robots.txt method should be better because bots can read it before.

hutattedonmyarm · 2024-03-06T06:42:15Z

@alanorth Not the whole page, I was only talking about the links in search-filters.component. So the checkboxes which check/uncheck all the filters in the search results sidebar. These are implemented as links. Currently, crawlers follow them, because they're part of an entities page. But they only lead to search results

alanorth · 2024-03-07T06:12:39Z

@hutattedonmyarm oh yes, I was confusing the rel=nofollow with other robot instructions in head meta tags. I think you are right that we should make those links rel=nofollow.

Fix for issue #2709 (cherry picked from commit fbd3529)

bram-atmire added bug needs triage New issue needs triage and/or scheduling labels Dec 12, 2023

dspace-bot added this to DSpace Backlog Dec 12, 2023

github-project-automation bot moved this to 🆕 Triage in DSpace Backlog Dec 12, 2023

bram-atmire mentioned this issue Dec 12, 2023

Avoid excess load of bots going into search facet links on entity pages DSpace/DSpace#9227

Closed

bram-atmire added a commit that referenced this issue Dec 12, 2023

Update robots.txt.ejs

fbd3529

Fix for issue #2709

bram-atmire mentioned this issue Dec 12, 2023

Update robots.txt.ejs #2710

Merged

tdonohue removed this from DSpace Backlog Dec 12, 2023

tdonohue added this to DSpace 8.x and 7.6.x Maintenance Dec 12, 2023

github-project-automation bot moved this to 📋 To Do in DSpace 8.x and 7.6.x Maintenance Dec 12, 2023

tdonohue moved this from 📋 To Do to 🏗 In Progress in DSpace 8.x and 7.6.x Maintenance Dec 12, 2023

tdonohue assigned bram-atmire Dec 12, 2023

tdonohue added component: SEO Search Engine Optimization and removed needs triage New issue needs triage and/or scheduling labels Dec 12, 2023

tdonohue added this to the 7.6.2 milestone Dec 12, 2023

tdonohue closed this as completed in #2710 Apr 29, 2024

github-project-automation bot moved this from 🏗 In Progress to ✅ Done in DSpace 8.x and 7.6.x Maintenance Apr 29, 2024

github-actions bot pushed a commit that referenced this issue Apr 29, 2024

Update robots.txt.ejs

6447c37

Fix for issue #2709 (cherry picked from commit fbd3529)

tdonohue added the performance / caching Related to performance, caching or embedded objects label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid excess load of bots going into search facet links on entity pages #2709

Avoid excess load of bots going into search facet links on entity pages #2709

bram-atmire commented Dec 12, 2023

alanorth commented Jan 12, 2024

bram-atmire commented Feb 5, 2024

hutattedonmyarm commented Mar 5, 2024

alanorth commented Mar 6, 2024

hutattedonmyarm commented Mar 6, 2024

alanorth commented Mar 7, 2024

Avoid excess load of bots going into search facet links on entity pages #2709

Avoid excess load of bots going into search facet links on entity pages #2709

Comments

bram-atmire commented Dec 12, 2023

alanorth commented Jan 12, 2024

bram-atmire commented Feb 5, 2024

hutattedonmyarm commented Mar 5, 2024

alanorth commented Mar 6, 2024

hutattedonmyarm commented Mar 6, 2024

alanorth commented Mar 7, 2024