How to combat large amounts of Ai scrapers

Drunk & Root@sh.itjust.works · 13 days ago

How to combat large amounts of Ai scrapers

Fedditor385@lemmy.world · 13 days ago

Anubis is the name of the tool. Also, Cloudflare just announced they have something against AI scrapers.

Drunk & Root@sh.itjust.works · 12 days ago

ive been using Anubis my only issue is i would have to run more then one instance and i dont like cloudflare personaly

fubarx@lemmy.world · 13 days ago

If nginx, here’s an open-source blocker/honeypot: https://github.com/raminf/RoboNope-nginx

If you have it set up to be proxied or hosted by Cloudflare, they have their own solution: https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/

ikidd@lemmy.world · edit-2 12 days ago

I wonder why that RoboNope doesn’t just make a fail2ban entry for anything that accesses a disallowed url and drop them entirely.

Actually this look like it would do something similiar, then dumps them to fail2ban after the re-access the honeypot page too many times: https://petermolnar.net/article/anti-ai-nepenthes-fail2ban/

Drunk & Root@sh.itjust.works · 12 days ago

ill check robonope out seems promising

Bahnd Rollard@lemmy.world · 13 days ago

Wern’t there a few AI maze projects in the works? I wonder if running one of those for a bit will cause you to be added to an ignore list, clearly they dont respect your robots file.

slazer2au@lemmy.world · 13 days ago

Tar pits I think is the term they use to pollute AI data.

Fedditor385@lemmy.world · 13 days ago

I just realized an interesting thing - if I use Gemini, and tell it to do deep research, it actually goes to the websites it knows/finds, and looks up the content to provide up-to-date answers. So, some of those AI crawlers are actually not crawlers, but actual users who just use AI instead of coming directly to the site.

Soo… blocking AI completely could also potentially reduce exposure, especially as more and more people use AI to basically do searches instead of browsing themselves. That would also explain the amount of requests daily - could be simply different users using AI to research for some topic.

Point is, you should evaluate if the AI requests are just proxies of real users, and blocking AI blocks real users from knowing your site exists.

daddycool@lemmy.world · 13 days ago

some of those AI crawlers are actually not crawlers, but actual users who just use AI instead of coming directly to the site. Soo… blocking AI completely could also potentially reduce exposure.

Normally, websites want users to come to their site, instead of an AI search engine “stealing” the content and presenting it as it’s own. Yes, AI search engines are more convenient for the user, but in the end it will discourage website creators and thereby cut of it’s own “food supply”.

nfreak@lemmy.ml · 13 days ago

Yeah I’d consider blocking out both the bots and AI-users a win-win lmao

Zexks@lemmy.world · 12 days ago

We all understand that. But if those users keep insisting on giving everyone their life story and current option in world politics before giving us the bread recipe we came for, they can fade away.

Fedditor385@lemmy.world · 13 days ago

I understand, but the shift in user behaviour is significant and I think websites are not taking it into account. If the users move more and more to AI, and since Google introduced AI mode it’s only a question of time until it becomes the default, we will see more and more of what we thing are AI crawlers and less and less organic users.

AI seems to be the new middleman between you and the user, and if you block the middleman, you block the user. For people with hobby websites or established sites it may make sense because people either know of them, or getting more exposure is not a wish or requirement, but for everyone else, it will be painful.

lambalicious@lemmy.sdf.org · 12 days ago

So, what I’m reading is, if your “users” are bad (or bots), just get better users.

Sounds like a net win.

Drunk & Root@sh.itjust.works · 12 days ago

this does not really apply because i run some frontends so there is not really any information that ai needs

Grumuk@lemmy.ml · 13 days ago

I’ve seen people mention Anubis, the other one I heard about in a blog post that’s maybe worth looking into is go-away.

some_guy@lemmy.sdf.org · 11 days ago

I don’t have opensource solutions, but CloudFlare had some news about a system that I didn’t read about (saw two headlines) last week. Dunno if it works or not.