everytime i check nginx logs its more scrapers then i can count and i could not find any good open source solutions
Anubis is the name of the tool. Also, Cloudflare just announced they have something against AI scrapers.
ive been using Anubis my only issue is i would have to run more then one instance and i dont like cloudflare personaly
If nginx, here’s an open-source blocker/honeypot: https://github.com/raminf/RoboNope-nginx
If you have it set up to be proxied or hosted by Cloudflare, they have their own solution: https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/
I wonder why that RoboNope doesn’t just make a fail2ban entry for anything that accesses a disallowed url and drop them entirely.
Actually this look like it would do something similiar, then dumps them to fail2ban after the re-access the honeypot page too many times: https://petermolnar.net/article/anti-ai-nepenthes-fail2ban/
ill check robonope out seems promising
Wern’t there a few AI maze projects in the works? I wonder if running one of those for a bit will cause you to be added to an ignore list, clearly they dont respect your robots file.
Tar pits I think is the term they use to pollute AI data.
I just realized an interesting thing - if I use Gemini, and tell it to do deep research, it actually goes to the websites it knows/finds, and looks up the content to provide up-to-date answers. So, some of those AI crawlers are actually not crawlers, but actual users who just use AI instead of coming directly to the site.
Soo… blocking AI completely could also potentially reduce exposure, especially as more and more people use AI to basically do searches instead of browsing themselves. That would also explain the amount of requests daily - could be simply different users using AI to research for some topic.
Point is, you should evaluate if the AI requests are just proxies of real users, and blocking AI blocks real users from knowing your site exists.
some of those AI crawlers are actually not crawlers, but actual users who just use AI instead of coming directly to the site. Soo… blocking AI completely could also potentially reduce exposure.
Normally, websites want users to come to their site, instead of an AI search engine “stealing” the content and presenting it as it’s own. Yes, AI search engines are more convenient for the user, but in the end it will discourage website creators and thereby cut of it’s own “food supply”.
Yeah I’d consider blocking out both the bots and AI-users a win-win lmao
We all understand that. But if those users keep insisting on giving everyone their life story and current option in world politics before giving us the bread recipe we came for, they can fade away.
I understand, but the shift in user behaviour is significant and I think websites are not taking it into account. If the users move more and more to AI, and since Google introduced AI mode it’s only a question of time until it becomes the default, we will see more and more of what we thing are AI crawlers and less and less organic users.
AI seems to be the new middleman between you and the user, and if you block the middleman, you block the user. For people with hobby websites or established sites it may make sense because people either know of them, or getting more exposure is not a wish or requirement, but for everyone else, it will be painful.
So, what I’m reading is, if your “users” are bad (or bots), just get better users.
Sounds like a net win.
this does not really apply because i run some frontends so there is not really any information that ai needs
I’ve seen people mention Anubis, the other one I heard about in a blog post that’s maybe worth looking into is go-away.
I don’t have opensource solutions, but CloudFlare had some news about a system that I didn’t read about (saw two headlines) last week. Dunno if it works or not.