September 9, 2013

The Revenge of MySQL, Episode 1: Extravagant Gymnastics for a Small Data Problem

Episode -1: Background #

Consider thee the Right Single Quotation Mark (the “smart quote” used in words like “it’s”):

’	UTF-8	Latin1
’	`0xE2 0x80 0x99`	`0x92`

Notice that in the “Latin1” encoding (also called ISO-8859-1), the humble ’ is represented by one byte. In UTF-8, it’s represented by 3.

If you’re feeling like you need an Episode -2 to explain what the heck 0xE2 means, I’ve got you covered.

Episode 0: Introduction #

MySQL is partly popular because it’s so easy to get up and running with it.

At least, it makes you think so. But if you don’t know what you’re doing (and you used MySQL, so you probably don’t!), it may be secretly fucking you over, allowing you to live a terrible lie for years before laughing in your face when you notice all your corrupted data.

This is a three-part series documenting my year of encoding hell. I explain a disgustingly common MySQL pitfall, and the creative ways that I (and the teams of which I’ve been a part) worked through it.

I believe I’m writing in an approachable way even for near-beginners, but if you’re feeling totally lost in regards to this “encoding” nonsense, stop dallying and educate yo'self.

Episode 1: Extravagant gymnastics for a small-data problem #

While I was working for a previous employer, we attempted to upgrade from Ruby 1.8 to Ruby 1.9. Everything looked great, the app was running, so we deployed it. But soon! We got reports of things like “Mikeâ€™s” showing up instead of “Mike’s”.

Oh my goodness MySQL does this all the time. And here’s what happened:

Putting stuff into the database: #

Ruby 1.8 is a fairly dumb language, when it comes to character-encoding (I guess other languages are too, like PHP). It doesn’t really know anything about the bytes it’s throwing around.
Now I, being the modern individual I am, use a computer that IS smart about encodings, and thus uses UTF-8 (because UTF-8 can represent basically all the languages and even non-language characters out there, except maybe for some of Japan’s alphabets).
So I pass some UTF-8 bytes to the web app, where Ruby 1.8 takes my bytes and starts storing them in the database. And THIS IS WHERE IT ALL GOES WRONG.
The database is Latin1, because the schmucks that made MySQL thought it best to not bother n00bs with encoding worries, and instead delay all this pain until it’s 1000x worse. The really schmucky thing, they made the default Latin1!! Instead of UTF-8!! Ugh.
Okay, so now, MySQL says to Ruby, “Hey, I’m going to store these characters as Latin1. If they’re UTF-8 or something, I’ll convert them for you. What are they?”
And Ruby 1.8, being the idiot that it is, says “Uhhh… I don’t know, I guess they’re Latin1?”
MySQL says, “Great! I’ll store them as-is!”

BOOM. BOOOOOOM. I just ended up with UTF-8 bytes in a database that thinks they’re Latin1!

Ok, but Ruby 1.8 keeps being idiotic and customers never notice, because all of their computers are in UTF-8, and here’s what happens when…

Taking stuff out of the database: #

MySQL says to Ruby 1.8, “Hey, I’ve got Latin1 characters; if you need them in an intelligent encoding like UTF-8 I can do that for you.”
Ruby 1.8 says “Oh, boy, I really don’t know what the heck you’re talking about. How about you just give me the bytes as-is?”
MySQL obliges, which means Ruby 1.8 ends up with the good UTF-8 bytes it unknowingly stored in the first place.

Enter Ruby 1.9 #

Ruby 1.9 is smart about character encodings. Really freakin’ smart. It knows that the web app prefers UTF-8, so when it goes through the “putting stuff into the database” process, the UTF-8 bytes for ’ (0xE2 0x80 0x99) get properly converted to the Latin1 bytes (0x92).

BUT WHAT ABOUT WHEN OLD ’s COMES BACK OUT?

Exactly. You’re one step ahead of me. Let’s go through the “taking stuff out of the database” flow with Ruby 1.9:

MySQL says to Ruby 1.9, “Hey, I’ve got Latin1 characters; if you need them in an intelligent encoding like UTF-8 I can do that for you.”
Ruby 1.9 says, “As a matter of fact, I do need them converted to UTF-8!”
MySQL obliges, and does a conversion on bytes that are already UTF-8. It says “Let’s see, in Latin1, 0xE2 is ‘â’, 0x80 is ‘€’, and 0x99 is ‘™’.” And you end up with “Mikeâ€™s”!

So then, how does one upgrade to the obviously superior language, Ruby 1.9, without mangling data? #

One other caveat that we didn’t really cover: I am a modern individual, proudly using a computer that defaults to UTF-8. But it’s possible that other users of the website were using some other encoding.

That is, there are more than two encodings. Most of the world, these days, uses UTF-8. But there are still holdouts of older, crappier encodings, more analogous to Latin1.

Fortunately, at the company I worked at, all data on the website is entered in a controlled way by a limited set of people, and we knew that they had always been savvy enough to be using UTF-8. We knew that we had all UTF-8 data in our Latin1 database.

The ideal solution #

It would be really freakin’ cool if one could just reach into the settings of a MySQL database and turn a dial. Like, if there were a dial pointed at Latin1, and you just had to friggin’ turn it to UTF-8. Problem solved!

Perhaps such a happy path exists, but my searches for it turned up empty. Please inform me if you know better.

A damn dirty hack that works decently if you have a small amount of data #

The closest I found to turning-the-dial-to-point-at-Latin1 was this:

Convert each column to the binary encoding (if you just use CONVERT TO on a whole table, MySQL does stupid things)
Convert each column to the UTF-8 encoding

binary is an encoding that treats everything as bytes. When you ORDER BY in a binary database, “Zebra” ends up being before “apple” because “Z” is 0x5A and “a” is 0x61. When you do a database-backed search for “alfred” with a binary database, it will not find “Alfred”.

So converting each column to binary makes MySQL as clueless about what the bytes mean as Ruby 1.8 is. They’re all just bytes. When you get to the next step and convert each column to UTF-8, it therefore just leaves the bytes as-is.

This two-step conversion of each individual column seems like a buttload of gymnastics for something that, in my opinion, ought to be as simple as turning a dial. But it’s honestly the simplest thing I found. If you know of something easier, please share it!

Okay, so doing two conversions on each column manually is obviously out of the question. So I wrote a bash script that does it for you.

One gotcha #

A binary encoding is stupid enough to not know that “alfred” is the same as “Alfred”. Likewise, a UTF-8 encoding is smart enough to know that “Sophia” is the same as “Sophiá”. So your database-backed searches will now work way better for your non-English users! Great!

Yes, great, but when you’re doing this conversion, this newfound byte-intelligence could cause a problem. At the company I worked for, we had a “username” column that we told MySQL needed to be unique for each user. In Latin1, “Sophia” and “Sophiá” are different, so the uniqueness constraint was satisfied. But once we tried to switch to UTF-8, switching the “Sophiá” record to UTF-8 made MySQL notice, for the first time, that two usernames were the same. The uniqueness constrain was violated, and the conversion failed.

Here again, the small amount of data saved the day. We were able to determine that Sophiá had been a one-time visitor who hadn’t used the site in a couple of years. We decided to switch her username to “Sophiá2”, and then everything ran happily.

Again, this whole approach only worked because we had a small amount of data. In Episode 2 next week, I’ll discuss this problem writ large®, as is the case at PipelineDeals.

I hope you either learned a bit about character encodings and MySQL, or are enraged that I dealt with this in such a stupid manner. Either way, let me know in the comments!

Episode 2: PipelineDeals: Corrupted medium-data! Maybe just give up? #

Come back next week to see what happens when the database isn’t particularly small, and when thousands of users the world over enter data in a multitude of encodings!

Episode 3: Problem Child: How to handle bytes that think they’re supposed to look like “Mikeâ€™s” #

I got a database dump from an old database that no longer exists. Problem was, there were UTF-8 bytes in that Latin1 database, which ended up being needlessly transcoded and stored permanently in a file. There’s no longer any good version of the data in existence. What to do?

Kudos