unicode- gnegg

No. It’s not Â«justÂ» strings

On Hacker News, I came across this rant about strings in Ruby 1.9 where a developer was complaining about the new string handling in Ruby. Now, I’m no Ruby developer by even a long shot, but I am really interested in strings and string encoding which is why I posted the following comment which I reprint here as it’s too big to just be a comment:

Rants about strings and character sets that contain words of the following spirit are usually neither correct nor worth of any further thought:

It’s a +String+ for crying out loud! What other language requires you to understand this
level of complexity just to work with strings?!

Clearly the author lives in his ivory tower of English language environments where he is able to use the word “just” right next to “strings” and he probably also can say that he “switched to UTF-8” without actually really having done so because the parts of UTF-8 he uses work exactly the same as the ASCII he used before.

But the rest of the world works differently.

Data can appear in all kinds of encodings and can be required to be in different other kinds of encodings. Some of those can be converted into each other, others can’t.

Some Japanese encodings (Ruby’s creator is Japanese) can’t be converted to a unicode representation for example.

Nowadays, as a programming language, you have three options of handling strings:

1) pretend they are bytes.

This is what older languages have done and what Ruby 1.8 does. This of course means that your application has to keep track of encodings. Basically for every string you keep in your application, you need to also keep track what it is encoded in. When concatenating a string of encoding a to another string you already have that is in encoding b, you must do the conversion manually.

Additionally, because strings are bytes and the programming language doesn’t care about encoding, you basically can’t use any of the built-in string handling routines because they assume each byte representing one character.

Of course, if you are one of these lucky english UTF-8 users, getting data in ASCII and english text in UTF-8, you can easily “switch” your application to UTF-8 by still pretending strings to be bytes because, well, they are. For all intents and purposes, your UTF-8 is just ASCII called UTF-8.

This is what the author of the linked post wanted.

2) use an internal unicode representation

This is what Python 3 does and what I feel to be a very elegant solution if it works for you: A String is just a collection of Unicode code points. Strings don’t worry about encoding. String operations don’t worry about it. Only I/O worries about encoding. So whenever you get data from the outside, you need to know what encoding it is in and then you decode it to convert it to a string. Conversely, whenever you want to actually output one of these strings, you need to know in what encoding you need the data and then encode that sequence of Unicode code points to any of these encodings.

You will never be able to convert a bunch of bytes into a string or vice versa without going through some explicit encoding/decoding.

This of course has some overhead associated with it, as you always have to do the encoding and because operations on that internal collection of unicode code points might be slower than the simple array-of-byte-based approach, especially if you are using some kind of variable-length encoding (which you probably are to save memory).

Interestingly, whenever you receive data in an encoding that cannot be represented with Unicode code points and whenever you need to send out data in that encoding, then, you are screwed.

This is a defficiency in the Unicode standard. Unicode was specifically made so that it can be used to represent every encoding, but it turns out that it can’t correctly represent some Japanese encodings.

3) The third option is to store an encoding with each string and expose both the strings contents and the encoding to your users

This is what Ruby 1.9 does. It combines methods 1 and 2: It allows you to chose whatever internal encoding you need, it allows you to convert from one encoding to the other and it removes the need to externally keep book of every strings encoding because it does that for you. It also makes sure that you don’t intermix encodings, but I’m getting ahead of myself.

You can still use the languages string library functions because they are aware of the encoding and usually do the right thing (minus, of course, bugs)

As this method is independent of the (broken?) Unicode standard, you would never get into the situation where just reading data in some encoding makes you unable to write the same data back in the same encoding as in this case, you would just create a string using this problematic encoding and do your stuff on that.

Nothing prevents the author of the linked post to use Ruby 1.9’s facility to do exactly what Python 3 does (of course, again, ignoring the Unicode issue) by internally keeping all strings in, say, UTF-16 (you can’t keep strings in “Unicode” – Unicode is no encoding – but that’s for another post). You would transcode all incoming and outgoing data to and from that encoding. You would do all string operations on that application-internal representation.

A language throwing an exception when you concatenate a Latin 1-String to a UTF-8 string is a good thing! You see: Once that concatenation happened by accident, it’s really hard to detect and fix.

At least it’s fixable though because not every Latin1-String is also a UTF-8 string. But if it so happens that you concatenate, say Latin1 and Latin8 by accident, then you are really screwed and there’s no way to find out where Latin1 ends and Latin8 begins as every valid Latin 1 string is also a valid Latin 8 string. Both are arrays of bytes with values between 0 and 255 (minus some holes).

In todays small world, you want that exception to be thrown.

In conclusion, what I find really amazing about this complicated problem of character encoding is the fact that nobody feels it’s complicated because it usually just works – especially method 1 described above that has constantly been used in years past and also is very convenient to work with.

Also, it still works.

Until your application leaves your country and gets used in countries where people don’t speak ASCII (or Latin1). Then all these interesting problems arise.

Until then, you are annoyed by every of the methods I described but method 1.

Then, you will understand what great service Python 3 has done for you and you’ll switch to Python 3 which has very clear rules and seems to work for you.

And then you’ll have to deal with the japanese encoding problem and you’ll have to use binary bytes all over the place and have to stop using strings altogether because just reading input data destroys it.

And then you might finally see the light and begin to care for the seemingly complicated method 3.

</span>

(Unicode-)String handling done right

Today, found myself reading the chapter about strings on diveintopython3.org.

Now, I’m no Python programmer by any means. Sure. I know my share of Python and I really like many of the concepts behind the language. I have even written some smaller scripts in Python, but it’s not my day-to-day language.

That chapter about string handling really really impressed me though.

In my opinion, handling Unicode strings they way python 3 is doing is exactly how it should be done in every development environment: Keep strings and collections of bytes completely separate and provide explicit conversion functions to convert from one to the other.

And hide the actual implementation from the user of the language! A string is a collection of characters. I don’t have to care how these characters are stored in memory and how they are accessed. When I need that information, I will have to convert that string to a collection of bytes, giving an explicit encoding how I want that to be done.

This is exactly how it should work, but implementation details leaking into the language are mushing this up in every other environment I know of making it a real pain to deal with multibyte character sets.

Features like this is what convinces me to look into new stuff. Maybe it IS time to do more python after all.

Failing silently is bad

Today, I’ve experienced the perfect example of why I prefer PostgreSQL (congratulations for a successful 8.3 release today, guys!) to MySQL.

Let me first give you some code, before we discuss it (assume that the data which gets placed in the database is – wrongly so – in ISO-8859-1):

This is what PostgreSQL does:

bench ~ > createdb -Upilif -E utf-8 pilif
CREATE DATABASE
bench ~ > psql -Upilif
Welcome to psql 8.1.4, the PostgreSQL interactive terminal.

Type:  copyright for distribution terms
       h for help with SQL commands
       ? for help with psql commands
       g or terminate with semicolon to execute query
       q to quit

pilif=> create table test (blah varchar(20) not null default '');
CREATE TABLE
pilif=> insert into test values ('gnügg');
ERROR:  invalid byte sequence for encoding "UTF8": 0xfc676727293b
pilif=>

and this is what MySQL does:

bench ~ > mysql test
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 97
Server version: 5.0.44-log Gentoo Linux mysql-5.0.44-r2

Type 'help;' or 'h' for help. Type 'c' to clear the buffer.

mysql> create table test( blah varchar(20) not null default '')
    -> charset=utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into test values ('gnügg');
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from test;
+------+
| blah |
+------+
| gn   |
+------+
1 row in set (0.00 sec)

mysql>

Obvisouly it is wrong to try and place latin1 encoded data in an utf-8 formatted data store: While every valid utf-8 byte sequence is a valid latin1 byte sequence (latin1 does not restrict the validity of bytes, though some positions may be undefined), the reverse certainly is not true. The character ü from my example is 0xfc in latin1 and U+00fc in unicode which must be encoed as 0xc3 0xbc in utf-8. 0xfc alone is no valid utf-8 byte sequence.

So if you pass this invalid sequence to any entity accepting an utf-8 encoded byte stream, it will not be clear what to do with that data. It’s not utf-8, that’s for sure. But assuming that no character set is specified with the stream, it’s impossible to guess what to translate the byte sequence into.

So PostgreSQL sees the error and bails out (if both the server and the client are set to utf-8 encoding and data is sent in non-utf8-format – otherwise it knows how to convert the data – conversion from any character set to utf-8 is possible all the time). MySQL on the other hand decides to fail silently and to try to fix up the invalid input.

Now while I could maybe live with the default of assuming latin1 encoding, just stopping to process the data without warning what so ever leads to undetected loss of data!

What if I’m not just entering one word? What if it’s a blog-entry like this one? What if the entry is done by a non tech-savvy user? Remember: This mistake can easily be produced: Wrong Content-Type headers, old browsers, broken browsers… it’s very easy to get Latin1 when you want utf-8.

While I agree that sanitization must be done in the application tier (preferably on the model), it’s inacceptable for a database application to store different data than what it was ordered to store without warning the user in any way. This easily leads to data loss or data corruption.

There are many more little things like this where MySQL decides to silently fail where PostgreSQL (and any other database) bail out correctly. As a novice this can feel tedious for you. It can feel like PostgreSQL is pedantic and like you are faster with MySQL. But let’s be honest: What do you prefer? An error message or lost data with no way of knowing that it’s lost?

This, by the way, is the outcome of a lengthy debugging session on a Typo3 installation, which also, but not ultimately is to blame here. In a perfect world, MySQL would bail out, but Typo3 would either

Not specify charset=utf8 when creating the table unless specifically asked to.
Send a charset=utf-8 http-header, knowing that the database has been created as containing utf-8
Sanitize user input before handing it over to the mysql-backend which is obviously broken in this instance.

Now back to debugging real software on real databases *grin*