Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend bot rules #48

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jocel1
Copy link

@jocel1 jocel1 commented Dec 4, 2022

  • simplify / use more generic bot rules
  • add extra bots (ia_archiver, gtmetrix, lighthouse)

- add extra bots (ia_archiver, gtmetrix, lighthouse)
@jocel1 jocel1 changed the title extend bot rules Extend bot rules Dec 4, 2022
@dridi
Copy link
Member

dridi commented Dec 5, 2022

Bonjour @jocel1,

Since your change is doing two distinct things, I would rather see two commits. There's also no explanation or justification for why we should generalize certain rules. Not being a historical maintainer of this project, I can't tell why choices were made and whether it's a good idea to challenge them.

One thing you could do for example is share a list of user agents to add test coverage, to make sure we don't break previous expectations.

@jocel1
Copy link
Author

jocel1 commented Dec 5, 2022

Hi @dridi!

For the first one : (?i)(ads|google|bing|msn|yandex|baidu|ro|career|seznam|)bot is stricly equivalent to (?i)bot since we have at the end and empty "|" condition

The main reason to add "google" is to cover Google Adsense user-agent: Mediapartners-Google. I also checked google pixels don't have "google" in their user agent, but we could perhaps add just this one.

For spider, I often discover new bots like Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/), Mozilla/5.0 (compatible; seoscanners.net/1; [email protected]) or CheckMarkNetwork/1.0 (+http://www.checkmarknetwork.com/spider.html), so having a generic "spider" was easier, and seems to be safe like "bot".

ia_archiver is a common bot https://user-agents.net/string/ia-archiver

I also changed facebook to match
user-agent: facebookcatalog/1.0

For the last one : (?i)(web)crawler the syntax sounds like (?i)(web)?crawler was expected, to match for example:

user-agent: Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)

For gtmetrix / lighthouse I don't know if we should see them as bot or not, perhaps create a new category for those ones, like "synthetic-bot" ? (we could add in them "Synthetic" to match dynatrace as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants