Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

One annoyance with that approach is that a single non-latin-1 codepoint is enough to double the size of the String; if it were an UTF-8 String, it would only add a couple of extra bytes.


sort by: page size:

> huge data wastage when certain languages are encoded

Is there a language that consistently uses codepoints with more than 2 bytes?

It bothers me a bit that UTF-8 is not an infinitely extendable encoding. But that also isn't an important objection, because it is finite, but huge.


I think both of those things are both true. I'm guessing there are currently, only 1.1M code points defined, and these fit in 4 bytes. However, there are currently-unallocated code points that go higher which could occupy he remaining 2 bytes that can be used with UTF-8.

Isn't that the case with UTF-8 as well? A single Latin character is 1 byte, characters from other alphabets are 2 or 4 bytes

in my opinion utf8 should have been a bigger variable length encoding, today it is:

0xxxxxxx

110xxxxx 10xxxxxx

1110xxxx 10xxxxxx 10xxxxxx

and

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

the only reason not to push those last bits and add

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

and maybe even

11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

is utf-32, they should have dropped it and solve the codepoint problem this way.


You're right! My bad - the first characters are encoded the same across ASCII, UTF-8, and Latin-1, but the second half of Latin-1 differs from UTF-8. So even just having to support those first 256 code points, we jump into multi-byte UTF-8 territory, meaning complexity over Latin-1.

In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.

because it'd have an extreme effect on performance?

UTF-8 is great in terms of size, but terrible in terms of speed.


This explanation has little to do with why. Latin-1 guarantees each character is coded into a single set of 8 bits, UTF-8 is a variable width encoding. The point of giving an encoding is so it is known how to decode it and a string passed as Latin-1 comes with guarantees about character positions and so on without parsing.

> In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often

You wouldn't want to process a single code point (or unit) at a time anyways, but 16, 32 or 64 code units (or bytes) at once.

That UTF-8 strlen I wrote had no mispredicts, because it was vectored.

Indexing is slow, but the difference to UTF-16 is not significant.

I guess locale based comparisons or case insensitive operations could be slow, but then again, they'll need a slow array lookup anyways.

Which string operation(s) are you talking about?


> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs

As other top-level comments and the article have mentioned, different systems still use different internal representations, but all modern ones commit to being able to represent Unicode code points, with no automatic normalization. To that end I suppose that in terms of reasoning about the consistency of data in different systems, it is better to use code points than the actual size, which is left to the implementation.

The possibly better alternative would be to use lengths in UTF-8, but that might seem arbitrary to some. Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.


I think the majority of the time you want the number of utf8 characters and not the byte count. In fact I have never wanted the byte count. If I did I would expect something like byteLen and Len. Not the other way around. You should be optimizing the common case, not the exception. Obviously I'm not a language designer so perhaps I'm talking out of my ass but I've heard this complaint A LOT.

+1 agreed. Unicode (UTF-8) kills the old stream of bytes paradigm.

1 to 4 bytes.

How does utf8 make that difficult to answer?

When was the last time you iterated over a string of unicode points and said you know what would be handy right now? If these code points were split up into arbitrary and unusable bytes of memory.


UTF-8 was originally designed to handle codepoints up to a full 32 bits. It wasn't until later that the codepoint range was restricted so that 4 octets would be sufficient.

>>But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary.

>What? No! UTF-8 takes, at most, 4 bytes per code point.

I thought each half of a UTF-16 surrogate pair used 3 bytes in UTF-8, but it turns out that this is an incompatible modification of UTF-8 called CESU-8. http://en.wikipedia.org/wiki/CESU-8


Even this argument isn't great, though, because a lot of the time it doesn't matter. Like, for HTML, the majority of the time the UTF-8 version will be shorter regardless because of all of the single-byte characters used in the markup.

I'm not surprised it doesn't make much of a difference with HTML since the markup contains so much ASCII. But 25 vs 18 kB is almost 40%. That might not be insignificant depending on how much text you're storing.

But it's a nitpick really, I just thought it he should have noted some of the disadvantages to UTF-8 as well.


UTF-8 is 1-4 bytes per codepoint dude, not 1-2.

280 * 4 = 1120, not 560.


But not every 8-bit byte string is valid UTF-8 so that could still cause a world of pain.
next

Legal | privacy