One annoyance with that approach is that a single non-latin-1 codepoint is enough to double the size of the String; if it were an UTF-8 String, it would only add a couple of extra bytes.
I think both of those things are both true. I'm guessing there are currently, only 1.1M code points defined, and these fit in 4 bytes. However, there are currently-unallocated code points that go higher which could occupy he remaining 2 bytes that can be used with UTF-8.
You're right! My bad - the first characters are encoded the same across ASCII, UTF-8, and Latin-1, but the second half of Latin-1 differs from UTF-8. So even just having to support those first 256 code points, we jump into multi-byte UTF-8 territory, meaning complexity over Latin-1.
In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.
This explanation has little to do with why. Latin-1 guarantees each character is coded into a single set of 8 bits, UTF-8 is a variable width encoding. The point of giving an encoding is so it is known how to decode it and a string passed as Latin-1 comes with guarantees about character positions and so on without parsing.
> In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often
You wouldn't want to process a single code point (or unit) at a time anyways, but 16, 32 or 64 code units (or bytes) at once.
That UTF-8 strlen I wrote had no mispredicts, because it was vectored.
Indexing is slow, but the difference to UTF-16 is not significant.
I guess locale based comparisons or case insensitive operations could be slow, but then again, they'll need a slow array lookup anyways.
> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs
As other top-level comments and the article have mentioned, different systems still use different internal representations, but all modern ones commit to being able to represent Unicode code points, with no automatic normalization. To that end I suppose that in terms of reasoning about the consistency of data in different systems, it is better to use code points than the actual size, which is left to the implementation.
The possibly better alternative would be to use lengths in UTF-8, but that might seem arbitrary to some. Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.
I think the majority of the time you want the number of utf8 characters and not the byte count. In fact I have never wanted the byte count. If I did I would expect something like byteLen and Len. Not the other way around. You should be optimizing the common case, not the exception. Obviously I'm not a language designer so perhaps I'm talking out of my ass but I've heard this complaint A LOT.
When was the last time you iterated over a string of unicode points and said you know what would be handy right now? If these code points were split up into arbitrary and unusable bytes of memory.
UTF-8 was originally designed to handle codepoints up to a full 32 bits. It wasn't until later that the codepoint range was restricted so that 4 octets would be sufficient.
>>But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary.
>What? No! UTF-8 takes, at most, 4 bytes per code point.
I thought each half of a UTF-16 surrogate pair used 3 bytes in UTF-8, but it turns out that this is an incompatible modification of UTF-8 called CESU-8. http://en.wikipedia.org/wiki/CESU-8
Even this argument isn't great, though, because a lot of the time it doesn't matter. Like, for HTML, the majority of the time the UTF-8 version will be shorter regardless because of all of the single-byte characters used in the markup.
I'm not surprised it doesn't make much of a difference with HTML since the markup contains so much ASCII. But 25 vs 18 kB is almost 40%. That might not be insignificant depending on how much text you're storing.
But it's a nitpick really, I just thought it he should have noted some of the disadvantages to UTF-8 as well.
reply