As you may have noticed in our release history, the EXTENDED STRING data type was removed almost a year ago. To be precise, it was more a merge of the STRING and EXTENDED STRING data than the removal of the latter. This means now you can safely use STRING data type for any kind of characters (including emoji! ????) without worrying if the column would support it. In addition, the sorting order for multiple non-Latin languages has been fixed automatically.
Relieving developers from the hassle of two string types has always been one of our main goals, and this is another small step in that direction.
In this post, I’ll explain our reasoning for introducing the somewhat confusing EXTENDED STRING data type earlier and how we managed to finally solve the initial problem with no additional complication for the developers using our platform.
It all began with… emoji
As anyone would do many years later, when choosing a character set for our database, there was no doubt we’d use utf8 . That’s an obvious decision when you’re going to support multiple languages in your database, since the utf8 charset covers the whole Unicode. It was only recently that we found that actually MySQL’s utf8 (which is in effect utf8mb3 — it’s merely an alias) can only support 3 byte characters, while for the full Unicode support you need 4 bytes. Anyway, at that time it didn’t matter much, because almost every possible language character still fits in 3 bytes, therefore we had no issues with that for a long time. It’s usually the really special characters which require 4 bytes to be stored. Then, all of a sudden these really special characters became very widely used. I guess you already figured out what characters I refer to.
Accepting emojis in a Unicode supplementary plane was a secret plan to make sure developers handle non-BMP characters!
— Vincent Isambart (@vincentisambart) February 12, 2014
A few years ago our customers started to experience different emoji-related problems, which related to the impossibility of saving emoji characters to either STRING or TEXT data type columns. Since we understood that emojis wouldn’t be going anywhere anytime soon, we had to deal with it somehow. At the time, a migration from utf8mb3 to utf8mb4 wasn’t considered an option due to the different priorities and the lack of time. The migration still isn’t at all automatic and leaves a chance for some data loss. Also, we haven’t got to investigate this a lot, so we were a bit afraid of the storage overhead we might get by increasing the column size. So we ended up with a quick solution to add a special utf8mb4 -encoded data type, which we called EXTENDED STRING (by the way, there were a few discussions on the naming, including the EMOJI STRING). With a few additional complications in the code, this solution served us well for more than 2 years, though it has always been a bit offending to the eye.
A transparent solution
Initially, this post was only meant to tell a bit about character sets and encodings, and justify the need for the EXTENDED STRING column data type. But as the investigation proceeded, it became apparent that we don’t really need to complicate the choice of your string data column with different data types. It was decided that we can make any string an extended string, since the probably-1-byte per value overhead is pretty much negligible. Fortunately, this could be done by merely changing the default character-set-server and collation-server options in MySQL, so that the new application databases are created with a proper 4-byte charset. And of course a bunch of code clean-ups by our team behind the scenes to remove the EXTENDED STRING data type (with lots of complications attached to its processing) while keeping support for importing the old data exports.
The issue which remains is the migration of the existing databases. Since there is a chance that the data might be lost during automatic migration, we decided we won’t run it on the old apps. So in case you need to convert a string column in your existing app to support extended characters, please contact our support and we’ll happily perform a safe manual migration. In case you already had a column of type EXTENDED_STRING, then it will continue to work as expected with extended characters.