I have been in character set hell, on and off, for a couple months. A big database change I had intended was scuttled largely due to character set issues.
There are a bunch of related issues, but one of them is the very old software we started with, and the other is the relentless march of time. The original forum software (and database) used a simple latin character set (ISO-8859-1) which was completely reasonable at the time. It didn't handle extended unicode, but as the forum is primarily made up of English-speakers posting in English, it wasn't really an issue.
One of the problems that went unnoticed for a very long time, though, was the fact that Windows browsers, being authored largely by crazed chimps hopped up on adderall, started sneaking illegal characters into the forum database that were explicitly NOT part of the ISO-8859-1 character set. They were, instead, part of the almost-but-not-quite-identical CP-1252 character set. Both are 1-byte character sets, meaning all characters are encoded in a single byte. But the Windows CP-1252 character set defines a few extra characters in a range that ISO-8859-1 considers to be off-limits.
Most of the errant characters that made it into the MV database are so-called "smart quotes", by which I mean "@#$% Microsoft". There are a few other characters that show up as well, though: endash, emdash, ellipsis, and the Euro symbol. None of these have exact equivalents in ISO-8859-1, though reasonable substitutions can be made for most.
To be fair, the forum software should probably have rejected the incoming data encoded with CP-1252 characters. Even better would be if, every time a Windows user submitted a post with non-ISO-8859-1 characters, the forum would deliver a mild electric shock through the internet and into the user's hand through the mouse they clicked the "Submit" button with. Even better than that would be if, every time a Windows user submitted a post with non-ISO-8859-1 characters, the forum would cause a medium-sized turd to fall out of the sky onto the head of a random Microsoft programmer somewhere in the world.
But I digress.
After 16 years of accumulated cruft, I now have a fairly sizable mess to clean up. It's taken me a couple months to get around to actually fully understanding the magnitude of the problem, and while I feel like I've now got a pretty good handle on it, there remains the possibility that I haven't even scratched the surface.
I have just put in place a filter on all data coming into the forum that explicitly looks for the subset of CP-1252 characters that do not exist in ISO-8859-1, replacing some of these with close-enough versions. The Euro symbol is the most problematic, as there really aren't any good single-character substitutions in ISO-8859-1. In this case, the Euro symbol will be replaced with the word "Euro".
The long-term goal, though, is to NOT be limited to ISO-8859-1, but instead to allow the forum to accept input encoded as UTF-8. UTF-8 is widely regarded as the One True Character Set to support, as it largely eliminates all of the incompatibility by simply allowing every known character to be used. This means that you could, theoretically, post any character from any known language in the world, and the forum would happily accept it. And then Jim would unceremoniously delete that post on sight.
But I digress, again.
In order to get to the UTF-8 goal, I need the current MV database to be homogeneously ISO-8859-1 prior to conversion. That will mean writing new code to scrub every single post, every single post title, every single signature, and so on. I shudder to think how many usernames have random characters in them that don't conform to ISO-8859-1 -- those users may not be able to log in ever again. Hmmm. That would probably be bad.
I will then probably have to figure out how to re-index the search tables that MV uses for search. Or throw away the current search function and replace it with something better. Fun!
If you've made it this far through the technobabble, congratulations. I will leave you with one of my favorite jokes as a reward:
Quote:
The day that Microsoft makes a product that doesn't suck is the day they start making vacuum cleaners.
