some utf8 problems and solutions

This week, I was tackling some character set issues in our CMS.

The biggest problem I had was that files uploaded and imported into MySQL were losing important characters through translation. The files included characters such as €, á, Á, etc.

The problem seemed simple at first – up until very recently, most servers were set up by default to use the Latin1 character set (ISO-8859-1). This character set is ideal for American use, and closely matches their ASCII character set.

ISO-8859-1 is a superset of ASCII, with western European characters added. This includes the accented characters such as á, é, í, etc, which are used in the Irish language.

Trouble starts, though, when you start considering more recent characters such as the Euro symbol, €.

The Euro, and other characters, wer added to the Latin character set in ISO-8859-15. This involved removing some other characters, though.

The ISO-8859 characters sets are all 8-bit sets, which means that each one can hold only 256 characters at the most. This might be okay when you only consider one language, but when you go beyond one language, you need to either start using other character sets, or consider using UTF8 or UTF16.

Using many different character sets would be a nightmare undertaking, so it is easier to simply use UTF8 for everything.

This sounds simple, but there is unfortunately a lot of work needed in quite a few areas before UTF8 can be easily used for all purposes.

For example, I mentioned the trouble I had importing files into MySQL. This was happening because the default character set used by PHP’s MySQL binding is Latin1. You can see this by echoing the output of mysql_client_encoding() on any RPM-installed PHP.

This encoding is just one point in a chain of string handlers, though. A string must pass from PHP itself into the MySQL binding, through it to the MySQL server. Each one of these steps can have any different default character set. You can see how many different settings are involved by opening your console, entering a MySQL client, and typing show variables like "c%";. The “character” and “collation” values are all different character sets that are currently active.

The manual page for this function hints that it is possible to change the connection character set to UTF8, but in practice, it is not simple.

A lot of further work revealed that, in fact, the default mysql_* functions simply cannot handle character sets very well. Instead, you must use the mysqli_* functions. There is simply no simple way around this!

Luckily for me, this was pretty straightforward for my own CMS. I had begone work a long time ago on abstracting the database calls away, so it was a simple matter to change the abstraction (I use Pear::DB) to use ‘mysqli’ instead of ‘mysql’.

To get the ‘mysqli’ bindings to speak in only UTF8, I had to hack the Pear::DB library slightly. Within the connect() function, immediately after the ini_set('track_errors', $ini); line, I added these lines:

@mysqli_query('set names "utf-8"');
@mysqli_query('set character set "utf8"');
@mysqli_query('set character_set_server="utf8"');
@mysqli_query('set collation_connection="utf8_general_ci"');

That was enough to solve the file import problems. Suddenly, everything worked! Please note the difference between “utf-8” and “utf8” in the above – I’m too tired to go digging deeper into finding a single standard way – the above works, and I’m happy enough with that.

A knowledgeable person might notice that the function calls there are not strictly correct. Ideally, they should be similar to @mysqli_query($this->connection,'set character set "utf8"');, but that was causing errors which I haven’t discovered the cause of yet.

Since I was working away on moving everything to UTF8, I decided to try sending all JavaScript as UTF8 as well.

To send all .js files as UTF8, simply add this line to your root .htaccess file:

AddCharset utf-8 .js

Hehe – funnily, to get a copy of that line (my memory isn’t perfect 😉 ), I ran a search in Google which returned one of my own pages as the top result.

This is fine for strictly JavaScript files. However, there is a very big gotcha when it comes to Ajax and UTF8… if you try send results back to an IE6 or IE7 (but apparently not IE5) XMLHTTPRequest object with UTF8 encoding (for example, using PHP’s header('Content-type: text/plain; Charset=utf-8');), IE6 will complain with a very confusing warning – “Error: System error: -1072896658” in IE6, or “msxml3.dll error ‘c00ce56e'” in IE7.

Long story short? IE still sucks.

Tired now. Think I’ll bring the kid to bed, then crack open a beer. I hope these notes are of use to someone!

5 Comments.

  1. Thanks, were useful for me.

    UTF-8 is a nightmare!!!!

  2. It isn’t UTF-8 that is the nightmare, it’s Internet Explorer (especially their latest version 7.0 — the beta worked better!)

  3. I got the same problem with the “Error: System error: -1072896658? in IE6 and error ‘c00ce56e’” in IE7 but my Charset is not utf-8 it is iso-8859-1

  4. Hi Kae, is there any solution to the XMLHTTPRequest/c00ce56e error in IE6/7? Ive tried converting all the data back to anything but UTF-8 to no avail. How could I work around this?

  5. Ted, I’m sorry to say that the answer does not occur to me. Even though I still work with UTF8 (see my KFM project, for example (http://kfm.verens.com/)), I have never encountered that problem again. I had totally forgotten about it.

%d bloggers like this: