jess wrote:
There's no BBCode processing on attachment titles. Emoticons are not expected to work there.
Atypical Canadian
2009 Vespa S50(LX150 motor swap), 2006 Vespa GTS250ie
Joined: UTC
Posts: 2319 Location: Toronto, Canada |
UTC
quote
jess wrote: There's no BBCode processing on attachment titles. Emoticons are not expected to work there. |
OP
|
OP
|
UTC
quote
jess wrote: I have some ideas. Let's see which of us is smarter -- the bots or the humans. So, day 1: the bots are winning. |
Hooked
GTS 300 HPE (2020); V-Strom 650 XT (2019)
Joined: UTC
Posts: 184 Location: SF Bay Area, California |
UTC
quote
jess wrote: So, day 1: the bots are winning. If you don't mind me asking, since you're using CloudFront elsewhere, could that be used to deal with some of this? It looks like the live site is separate from the assets (static), but a strategy for database load might be to join them a little. I think either an Apache-first or CF-first could work. It wouldn't really be fixing the problem directly, but may be piling sandbags more efficiently. |
OP
|
UTC
quote
besupa wrote: Would it be fair to say that it's the load on the database that's the problem, or are you looking for an overall solution to bad actors? If you don't mind me asking, since you're using CloudFront elsewhere, could that be used to deal with some of this? It looks like the live site is separate from the assets (static), but a strategy for database load might be to join them a little. I think either an Apache-first or CF-first could work. It wouldn't really be fixing the problem directly, but may be piling sandbags more efficiently. CloudFront is not nearly as obvious of a win for dynamic content, though. I know it's possible but... I just would really prefer that the main server is the only source for the actual posts and topics themselves. I've spent years optimizing the living daylights out of the original forum software, and generally it's acceptable. The main server is only lightly loaded, and the database (which is a separate server) is similarly unchallenged. Both are running on burstable classes of hardware (T4g, specifically), and that's generally been sufficient for most of the use cases. The main traffic problems in terms of potential overwhelming traffic loads are generally the high-speed scrapers, which we can (mostly) detect and thwart before they do too much damage. We see them show up frequently, but they don't get very far. So really, I've got no reason for this bot-hunting adventure. Except... well, I'm a bit paranoid. I can see that someone is using residential VPN addresses to slowly and methodically scrape the site, and that worries me. I don't know if they're intending to publish an imposter site, or it is some kind of litigation-hungry compliance service, or...? I just really don't know. But it makes me queasy. Also, it's a technical challenge, and now that I'm retired, I need the mental exercise of solving these kinds of puzzles. |
OP
|
UTC
quote
There are three "official" session classes on Modern Vespa: registered users, known crawlers (e.g. google), and anonymous users. In theory, the anonymous class should be mostly humans that just haven't registered accounts. But in practice, I am guessing that more than 75% of the anonymous traffic is inauthentic. But that I mean, not actually human.
This statistic keeps me up at night. |
OP
|
UTC
quote
In the continuing saga of Operation FdaBots, I've been watching the logs and doing a lot of cross-referencing of IP addresses. I can see a persistent pattern of requests, but they're nearly impossible to prevent. A single request, every 10-20 seconds, from a different residential IP address each time. Never any additional navigation around the site, only the one request. Oh, and always claiming to be referred by a Google search.
Blocking these would be suicide for the site. And yet, I can't help escape the idea that these are inauthentic requests. I actually put a novel defense in place this afternoon to try to slow down this specific kind of request, and it did... SFA. Either the request is genuinely from a human, or the bots are smart enough to navigate through the additional requirements. Either way, on day 2, the score is: Bots 2, jess 0. |
Atypical Canadian
2009 Vespa S50(LX150 motor swap), 2006 Vespa GTS250ie
Joined: UTC
Posts: 2319 Location: Toronto, Canada |
OP
|
UTC
quote
adri wrote: Too seldom used? Also, generally speaking, text that isn't in the message body (i.e. posts) doesn't get BBCode formatting. Topic titles don't get formatting. Your list of scooters doesn't get formatting. The one exception besides posts is signatures. I also don't even want to think about what happens when someone includes an inline image or a YouTube video in an attachment comment. And finally, BBCode formatting comes at a processing cost. The code is surprisingly gnarly, and performs a ton of operations. Consider this MV-exclusive feature, where a link to a topic: https://modernvespa.com/forum/topic187082 Turns into a shorthand BBCode:[topic187082] And then is displayed as a fully-formed topic link inline in the post:[NSR] What's Pissing you off Today? III All that is done as part of the BBCode processing. The original forum software re-rendered every single post for every single topic on every single topic page view, which was a nightmare for server load. I've added cacheing of the post-processed HTML, which sacrifices database storage size to achieve fast page loads. Signatures are similarly rendered to HTML only once (or whenever you edit your sig) and kept in the database both as the original string and the post-processed HTML. The complexity level of all of this extra plumbing is high. I do not especially want to extend that mechanism to another field, especially given how little use that field gets. |
Moderaptor
The Hornet (GT200, aka Love Bug) and 'Dimples' - a GTS 300
Joined: UTC
Posts: 44337 Location: Pleasant Hill, CA |
UTC
quote
I have a feature request - having entered a search, and received the results, I'd like to post a link to that search, so that others could see the same results.
Positive
|
OP
|
UTC
quote
jimc wrote: I have a feature request - having entered a search, and received the results, I'd like to post a link to that search, so that others could see the same results.
Positive
|
OP
|
UTC
quote
I spent the day watching the logs again. It's about as much fun as watching paint dry, if there were a chance the paint would catch fire suddenly without warning. You'll just be sitting there, zoning out, watching the slow and steady crawl of text on the screen, when all of a sudden BAM BAM! BAM! BAM BAM BAM1!1!! and 100 requests go by in 3 seconds, usually from some shady network operator in SE Asia.
I've made a little bit of progress on Operation FdaBots, but not against the bots, per se. I've managed to identify some of the anonymous traffic that (a) makes a single request at a time at odd intervals, and (b) doesn't read like a human. It turns out that Google has a fleet of "prefetch proxy" servers whose sole job it is to fetch the content of search result links that might be clicked by a user. They do this from a proxy server in order to preserve the privacy of whoever is searching. No cookies are sent, and in fact they won't do the prefetch if the user has any cookies to the site in question. So these particular requests are still very much anonymous, and not entirely human, but probably genuine nonetheless. I managed to track down the header values that Google adds to these prefetch requests that ostensibly inform the site that it's a prefetch request, though there is very little that I can do in real time to validate that it is in fact Google -- mostly, I just have to take their word for it, and it's probably spoof-able. In any case, once I subtract out these prefetch proxy requests, I'm left with slightly less requests that seem suspicious -- but only slightly. So, let's see... Day 3: Bots 4, jess 1.
Positive
|
OP
|
UTC
quote
One of the obvious ways to thwart some of the mysteriously non-human anonymous requests would be to check the incoming IP address against a list of known-problem-proxy IPs -- in particular, IP addresses that are acting as a VPN egress point but are in a residential setting. I honestly don't care if people use VPNs, but the legitimate ones generally egress from a data center. VPNs that egress from a residential IP seem (to me) quite a bit shadier.
The problem with checking IP addresses in real time, though, is that it adds a lot of latency to every anonymous request. Or, at least, the first anonymous request we see from that IP address. Attempting to do reverse-DNS (to find the hostname) or using an IP proxy identification service would add tens or hundreds of milliseconds to the first request. For reference, I am pretty alarmed when any request on the site takes more than about 50ms to fulfill, and I'd prefer something below 30ms. As an experiment, though, I am considering doing a post-processing step on these mysterious one-off anonymous requests to gather some information and see what percentage of them are egressing from residential VPN IP addresses. That might at least give me some sense of how big of a problem I actually have. |
Atypical Canadian
2009 Vespa S50(LX150 motor swap), 2006 Vespa GTS250ie
Joined: UTC
Posts: 2319 Location: Toronto, Canada |
UTC
quote
jess wrote: Seldom used, and also part of an archaic attachment system that is unlikely to survive the year before being replaced with something slimmer. Also, generally speaking, text that isn't in the message body (i.e. posts) doesn't get BBCode formatting. Topic titles don't get formatting. Your list of scooters doesn't get formatting. The one exception besides posts is signatures. I also don't even want to think about what happens when someone includes an inline image or a YouTube video in an attachment comment. And finally, BBCode formatting comes at a processing cost. The code is surprisingly gnarly, and performs a ton of operations. Consider this MV-exclusive feature, where a link to a topic: https://modernvespa.com/forum/topic187082 Turns into a shorthand BBCode:[topic187082] And then is displayed as a fully-formed topic link inline in the post:[NSR] What's Pissing you off Today? III All that is done as part of the BBCode processing. The original forum software re-rendered every single post for every single topic on every single topic page view, which was a nightmare for server load. I've added cacheing of the post-processed HTML, which sacrifices database storage size to achieve fast page loads. Signatures are similarly rendered to HTML only once (or whenever you edit your sig) and kept in the database both as the original string and the post-processed HTML. The complexity level of all of this extra plumbing is high. I do not especially want to extend that mechanism to another field, especially given how little use that field gets. |
OP
|
UTC
quote
The latest threat actor that seems intent on attacking Modern Vespa is originating from Alibaba's data center, in the following CIDR ranges:
47.76.0.0/17 47.76.128.0/17 47.102.0.0/15 47.116.128.0/17 They are using a large number of IP addresses to avoid detection, but the requests are all grouped together within the same 1-2 second span, and all in similar IP ranges, so it's kind of obvious. They are attempting to access (and being rejected from) a bunch of resources that are explicitly forbidden by robots.txt, including the login page, the PM page, the posting page, and so on. Clearly whatever entity is using Alibaba's hosting is not a legitimate crawler -- not even close. Fuck those guys. |
OP
|
UTC
quote
Here's a snippet from the logs, showing the Alibaba datacenter request. The action they are attempting (the "posting.php" part) is definitely NOT allowed by unregistered users, and the fact that they tried the same sequence twice in a row tells me that they know fuck-all about what they are doing.
|
UTC
Molto Verboso
Gina, 1965 Vespa 180SS, Bella,1968 Vespa 150 Super, Mia, 2017 Vespa Primavera 70th Anniversary 150ie, Gabriella, 2017 GTS300 ABS
Joined: UTC
Posts: 1931 Location: Hamilton/Kirikiriroa, NZ |
|
Molto Verboso
Gina, 1965 Vespa 180SS, Bella,1968 Vespa 150 Super, Mia, 2017 Vespa Primavera 70th Anniversary 150ie, Gabriella, 2017 GTS300 ABS
Joined: UTC
Posts: 1931 Location: Hamilton/Kirikiriroa, NZ |
UTC
quote
For the first time ever that I have seen, we are showing as having 0 guests on the forum. Is this a problem or an odd coincidence?
|
OP
|
UTC
quote
pigletpilot wrote: For the first time ever that I have seen, we are showing as having 0 guests on the forum. Is this a problem or an odd coincidence? |
OP
|
UTC
quote
pigletpilot wrote: For the first time ever that I have seen, we are showing as having 0 guests on the forum. Is this a problem or an odd coincidence? And for finding (and most importantly reporting) this forum bug, I hereby bestow upon you the MV Entomologist Award. Wear it with pride.
Positive
|
|
UTC
quote
jess wrote: Here's a snippet from the logs, showing the Alibaba datacenter request. The action they are attempting (the "posting.php" part) is definitely NOT allowed by unregistered users, and the fact that they tried the same sequence twice in a row tells me that they know fuck-all about what they are doing. Looking there, as a couple of suggestions: add number formatting to make the large user counts more readable? Update timeseries heading from "recent activity" to "activity over past X hours"? |
OP
|
UTC
quote
berto wrote: Are these the "Blocked Bots" showing up in the page statistics section? berto wrote: Looking there, as a couple of suggestions: add number formatting to make the large user counts more readable? Update timeseries heading from "recent activity" to "activity over past X hours"? |
OP
|
UTC
quote
In the never-ending quest to understand which "guests" are human and which are bots, I've started to take a closer look at origins of the guests that show up and ask for only one (or sometimes two) pages, who don't read like humans, and who then never make another request.
I get these requests from all over the world, and the usual suspects are represented: China, Russia, and India. And some in the US as well. What's surprising is how many of these are coming from the UK. What's more surprising is that most of these random one-off guests from the UK appear to be originating from residential (i.e. not data center or VPN) hosts -- Virgin Media, BTCentral, and Sky Broadband seem to reappear quite a bit, each time at a fresh IP address. Hmmmm. |
Molto Verboso
2009 GTS250, Ducati Monster M900, KTM 390 Adventure, Honda CR125
Joined: UTC
Posts: 1695 Location: Oceanside, CA |
UTC
quote
jess wrote: What's surprising is how many of these are coming from the UK. What's more surprising is that most of these random one-off guests from the UK appear to be originating from residential (i.e. not data center or VPN) hosts -- Hmmmm. |
Moderaptor
The Hornet (GT200, aka Love Bug) and 'Dimples' - a GTS 300
Joined: UTC
Posts: 44337 Location: Pleasant Hill, CA |
UTC
quote
jess wrote: What's surprising is how many of these are coming from the UK. What's more surprising is that most of these random one-off guests from the UK appear to be originating from residential (i.e. not data center or VPN) hosts -- Virgin Media, BTCentral, and Sky Broadband seem to reappear quite a bit, each time at a fresh IP address. Hmmmm. |
OP
|
UTC
quote
jimc wrote: VM, BT, and Sky have the vast majority of broadband customers in the UK, over 75% between them, so that's not surprising. I'll add, the more clueful customers won't touch any of those with a bargepole. |
Moderaptor
The Hornet (GT200, aka Love Bug) and 'Dimples' - a GTS 300
Joined: UTC
Posts: 44337 Location: Pleasant Hill, CA |
UTC
quote
jess wrote: It's not the specific providers that are surprising, though -- it's that there are so many hits that don't register as human coming from those providers. The remaining ISPs are more niche, and tend to have the more technically minded customers, who may run their own servers etc.
Positive
|
OP
|
UTC
quote
Putting aside the UK queries for a moment -- and to be clear, they aren't overwhelming, they're just a drip-drip-drip in the background -- today's real attack vector appears to be originating out of Singapore.
Singapore shows up as a source of nuisance crawling from time to time. Today, I am getting overwhelmed by requests. Enough that I've had to put an emergency stopgap into place to keep the server from melting down. Guest clients arriving from Singapore (and a handful of other countries in the region) who immediately ask for a topic page (which is the usual target) will be asked to verify they are human in a very, very rudimentary (i.e. not very secure) fashion. Once they declare themselves human, we leave the session alone and let them browse in peace. So far, it seems to be keeping them at bay -- I'm still getting tons of requests, but they never make it past the "are you a human" page. Fingers crossed.
Positive
|
|
UTC
quote
jess wrote: Yes. I was catching up on Favorite threads, opening several (say 10) to new tabs in quick succession. After the first few, they all returned 403 Forbidden. Seems I got locked out? I'm not from Singapore, if that matters |
OP
|
UTC
quote
berto wrote: Am I a blocked bot?! I was catching up on Favorite threads, opening several (say 10) to new tabs in quick succession. After the first few, they all returned 403 Forbidden. Seems I got locked out? I'm not from Singapore, if that matters You definitely triggered the overlimit protection, which is there for both anonymous clients and registered users, from any country. The exact thresholds are looser for registered users, but they still exist. The exact algorithm I am using is new, but if anything, it should be a bit more permissive than the old algorithm, as it is now counting unique requests from a client -- so reloading the same page repeatedly won't trigger the over limit ban. If it does happen again, it will self-correct after some amount of time -- 10 minutes to an hour, depending on... reasons. |
|
UTC
quote
jess wrote: Heh. I was watching the logs when it happened. I've already un-blocked that browser/ip combo. You definitely triggered the overlimit protection, which is there for both anonymous clients and registered users, from any country. The exact thresholds are looser for registered users, but they still exist. The exact algorithm I am using is new, but if anything, it should be a bit more permissive than the old algorithm, as it is now counting unique requests from a client -- so reloading the same page repeatedly won't trigger the over limit ban. If it does happen again, it will self-correct after some amount of time -- 10 minutes to an hour, depending on... reasons. I typically use my mobile, where I assume this would never be a problem (fingers are slow!). But if I'm on my PC, it does happen that I'll open by Favorites synthetic forum and mouse-click all the ones I haven't read into new tabs. Then go and read through them one at a time. That's how I got to maybe 10 tabs in a few seconds. I've definitely never had the old algorithm lock me out before. (Not a complaint, just an observation.) |
OP
|
UTC
quote
berto wrote: But if I'm on my PC, it does happen that I'll open by Favorites synthetic forum and mouse-click all the ones I haven't read into new tabs. Then go and read through them one at a time. That's how I got to maybe 10 tabs in a few seconds. I've definitely never had the old algorithm lock me out before. (Not a complaint, just an observation.) The old algorithm was a fixed-window algorithm, meaning it would count requests in a fixed time period -- in this case the window was 10 seconds. But if you are straddling the window -- opening LIMIT/2 in the last few seconds of one window and LIMIT/2+1 more in the next window, you'd get away with it -- as long as you didn't open many more in the remaining seconds of the second window. The new algorithm uses a sliding window, which means that it is counting how many requests are made in any 10 second period. This means if you make LIMIT+1 requests, you trigger the overlimit protection regardless. In your case, you were clicking on "jump to first unread post" links. From your perspective, each click is one request. From the server's perspective, though, each of those is two requests, since the first request actually looks up the current status of the thread and then forwards you to the appropriate place in a second operation. Which is all to say that I should probably revisit the specific thresholds I have set. I really don't want to have registered users triggering the overlimit protection unless it's egregious.
Positive
|
Ossessionato
1979 P150X, 1983 P200E, 1987 PK125XL Elestart, 1988 T5, 1995 PX200E, 2011 Yamaha Fazer 600 S2
Joined: UTC
Posts: 4495 Location: Veria, Greece |
UTC
quote
I think it was working before, but touching now on a pic will no longer get it in full screen on my iPad & iPhone (using Safari)…
|
OP
|
UTC
quote
SaFiS wrote: I think it was working before, but touching now on a pic will no longer get it in full screen on my iPad & iPhone (using Safari)… I don't have a good compromise here. |
Ossessionato
1979 P150X, 1983 P200E, 1987 PK125XL Elestart, 1988 T5, 1995 PX200E, 2011 Yamaha Fazer 600 S2
Joined: UTC
Posts: 4495 Location: Veria, Greece |
UTC
quote
jess wrote: Someone threw a fit back on page 14, and now that feature is disabled on mobile. It still works on desktop systems, unless said desktop system is touch screen, in which case it doesn't. I don't have a good compromise here. |
OP
|
Ossessionato
1979 P150X, 1983 P200E, 1987 PK125XL Elestart, 1988 T5, 1995 PX200E, 2011 Yamaha Fazer 600 S2
Joined: UTC
Posts: 4495 Location: Veria, Greece |
UTC
quote
jess wrote: Yeah, I tend to agree. |
OP
|
UTC
quote
SaFiS wrote: Could it maybe be done as OS specific instead of touch - no touch?? I concede that pinch-zoom, where available, does give more flexibility than tap-to-enlarge, which is why I've made tap-to-enlarge contingent on the lack of touch. |
OP
|
UTC
quote
besupa wrote: If you don't mind me asking, since you're using CloudFront elsewhere, could that be used to deal with some of this? It looks like the live site is separate from the assets (static), but a strategy for database load might be to join them a little. I think either an Apache-first or CF-first could work. It wouldn't really be fixing the problem directly, but may be piling sandbags more efficiently. I've got a test instance running behind CF right now (which I'm using to post this message) and it is working better than I expected it to. Thanks for mentioning it, I had dismissed it and you got me to reconsider it. |
Hooked
GTS 300 HPE (2020); V-Strom 650 XT (2019)
Joined: UTC
Posts: 184 Location: SF Bay Area, California |
UTC
quote
jess wrote: Thanks for mentioning it, I had dismissed it and you got me to reconsider it. Happy hunting out there--it looks like an uptick in "automated traffic" recently and the scramble to get snapshots of all the web for various training is probably just going to make it worse. |
OP
|
UTC
quote
The bots are annoying me. It's not just that they are constantly trying to scrape the site, though -- it's that they are so incredibly bad at it that I am offended by their incompetence at what is actually a very straightforward task.
I've taken to taunting them in the error messages out of sheer boredom and disgust. I know that there is approximately a 0% chance of a human ever seeing this message. Still, it vents some of my anger at these idiots just to type the message.
Positive
|
Modern Vespa is the premier site for modern Vespa and Piaggio scooters. Vespa GTS300, GTS250, GTV, GT200, LX150, LXS, ET4, ET2, MP3, Fuoco, Elettrica and more.