besupa wrote:
Maybe the internet is just growing, getting faster, and is more populated by hungry bots and scrapers; maybe we could get you enough cups of coffee for a bigger hammer to deal with it all?
Yeah, that's a possibility. The demand for data seems to be insatiable these days. We've been able to get by on modest hardware for years, but I'm wondering if that is sustainable.
besupa wrote:
Traffic segregation could be another way, like shunting all non-logged-in pages to static/cached versions or something, refreshed at low frequency. Are there certain places the bots are getting that are particularly stressing the server, some kind of expensive page, or is it just too much overall?
Most of the requests are for topics, though there are some bots that seem intent on reading the wiki incessantly and some more that really like to aggregate lists of member profile pages. Oh, and some that seem to hit the topic search pages on their first request, but I've just today put in a defense against that.
We cache the rendered posts (as HTML) in the database already, and reassemble the topic from the already-rendered posts. This saves processing power (BBCode is expensive) but is still database-intensive, and that's where the bottleneck is right now. Unsurprisingly, the database is also already our largest single expense, and so increasing the capacity is going to be a bit pricier.
The server itself -- the raw processing -- seems to be up to snuff. When it doesn't have to hit the database, it can process a surprising volume of requests without breaking a sweat. Just a few minutes ago I watched it being pummeled by requests from the Alibaba data center (all of which are blocked by IP address) and it handled several thousand requests in a minute or so (see picture below straight from the logs).
The ones that have been problematic lately have been very swarm-oriented -- everything will be humming along fine, and then all of a sudden we get hit with requests from all over the world at once. I've been considering rate-limiting the number of new anonymous sessions across the board, and going into a protective mode at some yet-to-be-defined threshold -- kind of a temporary shield that would block additional anonymous sessions for a minute or two, enough for the database to catch its breath.
I'll investigate tomorrow. That shouldn't be too hard to implement.