jess wrote:
It's done. Well, the bulk of it, anyway.
Narrator: it wasn't actually done.
I have been chipping away at this project, on and off, attempting to convert the entire MV database to UTF8 (utf8mb4, to be precise, Oliver) and have made some headway. Turns out that the problem wasn't isolated solely to CP-1252 and Unicode, but (as I've recently discovered) I had some low-ASCII control characters creep in as well. No idea how, but there were a few posts in particular (that seemed to be copied from Piaggio press releases) where the bulk of it came from.
The
really nasty part is that some of these control characters ended up as database index keys, and when I tried to convert the database to UTF8, it would strip out the control characters but then end up with duplicate keys. This seems to be the central issue that I've been hitting all along whenever I've tried to export and then re-import the database.
I've put a new filter in place to prevent any more low-ASCII control characters (besides the few that are legit -- CR/LF/Tab) and I've written some new routines to scan the database for any more occurrences of it. I think I've finally gotten all of it now.
Narrator: ...
Shut up! @#$% narrator!
Anywho, I am currently doing another test-conversion of a copy of the MV database. It is currently crunching through the posts themselves, by far the largest single portion of the database. I haven't hit the historically-problematic tables yet (those would be the search index tables) but I am feeling fairly confident at this point.
Narrator: He...
SHUT UP!
Okay, where was I? Yeah. Test conversion. We'll see how it goes.