This seems like an unsustainable approach for the bots. There are, after all, a finite number of IP addresses available. But I've amassed a list of 87,000 IP addresses that have done a single page request and then (importantly) not executed the javascript embedded in that page -- which strongly suggests that those 87,000 different IP addresses were bots.
Unfortunately, I don't know that until after I've served them a page.
Fortunately, I can track all the pertinent details about their IP address and the client software they are using to help make an informed analysis about the next request. Typically, that would mean screening the statistically suspicious requests based on past behavior, making the visitor answer a "Are you human?" question that (surprisingly) thwarts most bots.
I don't want to make every new visitor go through that process, but if some humans see it that's probably okay. I'm currently getting it right about 80% of the time, with a roughly even split of the remainder being either over-screening humans or under-screening bots.
Anyway, this is just a very long introduction to the point I was really trying to make.
One of the details I have available about each IP address is its country of origin. This isn't too useful, except for a handful of notoriously bot-friendly countries -- China and Hong Kong being the most extreme examples. These are mostly unsophisticated bots, however -- they don't seem to do the one-request-one-ip-address approach.
But I graphed the data anyway, even though I'm not using it to make screening decisions. The results were... curious, to say the least.
The graph below is ordered by overall number of sessions, so the US is where the majority of our traffic comes from, with UK in second place. The validated:unvalidated column is the interesting one -- this roughly translates to humans vs bots, and is the best gauge I have available to make screening decisions when applied to groups of IP addresses or client software signatures. Here, though, it seems to show that a disproportionate number of these bots are using IP addresses in commonwealth countries plus Ireland.
(Edited to add: these are, for the most part, residential IP addresses, used by consumers on consumer networks. These are not data centers, which I screen out wholesale in a separate process.)
Which is weird. And not at all what I expected. I would have expected the human:bot split to be roughly similar to the US and UK.
I don't really have a very good explanation for why the split is so lopsided for these specific countries.
⚠️ Last edited by jess on UTC; edited 1 time
