Aggressive AI scrapers are making it kinda suck to run wikis

tofu@lemmy.nocturnal.garden · 7 天前

Aggressive AI scrapers are making it kinda suck to run wikis

poVoq@slrpnk.net · 7 天前

Not only wikis sadly. Anything that has public facing deep links that trigger extensive database operations are being hammered by these bots and few servers can take the load.

thatsnothowyoudoit@lemmy.ca · edit-2 6 天前

We use NGINX’s 444 response A LOT.

In coordination with careful rate-limiting, it’s been a dramatic improvement.

The worst of the bots don’t advertise their User Agent (or worse, attempt to present they’re a normal user making 100s of requests a second) but there’s lots of low hanging fruit.

grrgyle@slrpnk.net · 7 天前

We use NGINX’s 444 response A LOT.

Hmm interesting. I wasn’t aware of this one

Tiresia@slrpnk.net · 7 天前

On the plus side, this isn’t a problem with AI, this is a problem with AI companies having more investment money than they know what to do with. The moment the hype fades and they don’t want to hemmorage money scraping every wiki on the internet thousands of times per day, this traffic will go back to a far more sane amount.

rasterweb@fedia.io · 4 天前

So we should make more wikis filled with poison data… got it! ;)

poVoq@slrpnk.net · 7 天前

It will probably go down, but the process itself is kind of unavoidable for training LLMs, so I doubt things will go back to how they were before.