Lately, I am seeing a lot of traffic from undeclared crawlers, aka "bots". These are not the usual kind of crawlers (e.g. Google, Bing, etc) in that these bots are disguising themselves as regular users by publishing a standard user-like User-Agent string.
They're also not the usual spambots, which attempt to find a weakness in the defenses by continually trying elaborate sets of parameters designed to trigger a failure in the forum software.
Nor are they the usual high-speed-scrapers that breeze in and attempt to read hundreds or thousands of pages in under a minute. We have pretty good defenses against those (though not perfect) and usually stop them within about 10 requests.
This new breed of bots are fairly well-behaved, making requests at reasonable rates, and acting very much like the mature search engine crawlers (e.g. Google and Bing) -- except that they don't declare who they are, unlike mature search engine crawlers. In fact, these are almost certainly mature data-gathering entities of some kind, though it's not clear to what end.
Spot-checking IP addresses when I see them, it's a wide spectrum of possible culprits. Huwai-Cloud showed up earlier today, and had read hundreds or maybe thousands of pages from a single IP address, but paced itself enough not to hit the overlimit (high speed) threshold that would get it booted. I've seen a couple of cases where Comcast appeared to be scanning, but that is almost certainly someone doing something from a residential IP address. And then of course there's always plenty that resolve to well-known data centers (Azure, AWS, and even Google Cloud).
I'm okay with Google crawling the site, obviously. And even Bing. It's a mutually beneficial relationship. I allow a wide variety of other, smaller entities to crawl the site, even though the potential benefits are intangible. Pinterest? Sure. Why not. Apple? Well, I'm biased, so sure.
Then there are all the Fully Buzzword Compliant SEO Marketing companies that want to build their database and improve their SEO-juicing product, presumably so that they can try to sell it back to the site owners that provided the data to begin with. Screw those guys. I generally block them. At least they have the decency to declare their crawlers properly, though.
But the bots I'm really thinking about here are the stealthy ones, the ones that don't declare themselves but otherwise don't break any obvious rules. Though none of them are doing any tangible harm to the site,
in aggregate they are a drain on resources. It's not going to break the bank, I don't think it's costing us a
lot to serve these undeclared bots. But it's also not nothing.
Really, though, I think my real point is that Modern Vespa is for the humans, not for the bots. Allowing Google and Bing to crawl is
also for the humans. Anything else is just an electronic vampire.
And rather than just block IP addresses of well-known data centers (which I have done on occasion, and am doing right now for Huwai), I think it's time to develop some heuristics to filter out the bots based on the fact that they don't behave like humans do.
I have some ideas. Let's see which of us is smarter -- the bots or the humans.
⚠️ Last edited by jess on UTC; edited 1 time