Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

We can make it more fair by pricing UTF-8 bytes.

??? = 9 bytes

how are you = 11 bytes



sort by: page size:

Bytes are understandable but make no sense from a business point of view. If you submit the same simple query with UTF-8 and UTF-32, the latter will cost 4x as much.

and i find it fair that American characters require 2 bytes in Java as everybody else, not 1 as in utf-8! :)

because it'd have an extreme effect on performance?

UTF-8 is great in terms of size, but terrible in terms of speed.


UTF-8 is 1-4 bytes per codepoint dude, not 1-2.

280 * 4 = 1120, not 560.


I think the majority of the time you want the number of utf8 characters and not the byte count. In fact I have never wanted the byte count. If I did I would expect something like byteLen and Len. Not the other way around. You should be optimizing the common case, not the exception. Obviously I'm not a language designer so perhaps I'm talking out of my ass but I've heard this complaint A LOT.

Also: UTF-8 instead of UTF-16 can make a surprising difference on memory use, especially if you interact with services in UTF-8 and so spend a lot of memory in Java copying back and forth.

ASCII and UTF-8 are too US centric. That's why adoption in places like China is so low.

Also, if there's variable length encoding why can't we just do a proper way and improve size for the same computational cost?


>It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

For tons of questions, both can be correct.


1 to 4 bytes.

How does utf8 make that difficult to answer?

When was the last time you iterated over a string of unicode points and said you know what would be handy right now? If these code points were split up into arbitrary and unusable bytes of memory.


> In theory yes, in practice no.

That's like "in theory we need 4 bytes to represent Unicode, but in practice 3 bytes is fine" (glances at universally-maligned utf8mb3)


Dear databases, please don't get hung up about string lengths when dealing with UTF8.

If I ask for a UTF8 string with a max-length of 100, please don't apply the worse case scenario and allocate space for 100 emojis. Please give me a box of 100 bytes and allow me to write any UTF-8 string that can fit into 100 bytes in there.

100 ASCII characters. 20 emojis. Any mixture of the two.

If I ask for UTF-8, it'll be because I'd like to make advantage of UTF-8 and I accept the costs. If that means I can't quickly jump to the character at index 84 in a string, no problem, I've accepted the trade-off.


One annoyance with that approach is that a single non-latin-1 codepoint is enough to double the size of the String; if it were an UTF-8 String, it would only add a couple of extra bytes.

Even this argument isn't great, though, because a lot of the time it doesn't matter. Like, for HTML, the majority of the time the UTF-8 version will be shorter regardless because of all of the single-byte characters used in the markup.

I got the impression that was a performance trade-off. UTF-8 decoding/encoding isn't free.

But not every 8-bit byte string is valid UTF-8 so that could still cause a world of pain.

> It is a pain in the ass to have a variable number of bytes per char.

Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost every situation. The encoding time tradeoff was considered less objectionable than making most of your text twice or four times larger.


"But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary."

What? No! UTF-8 takes, at most, 4 bytes per code point.

"But UTF-8 isn't very efficient at storing Asian symbols, taking a whole three bytes. The eastern masses revolted at the prospect of having to buy bigger hard drives and made their own encodings."

Many asian users object to UTF-8/Unicode because of the Han Unification, and because many characters supported in other character sets are not present in Unicode. Size of the binary encoding has nothing to do with it -- in fact, most east-asian characters take 4 bytes in UTF-16.

"American programmers: In your day to day grind, it's superfluous to put a 'u' in front of every single string."

American programmers who aren't morons: Use 'u' or the first time somebody tries to run an accent through your code, it'll come out looking like line noise.


Such an API could easily indicate bytes in a UTF-8 string instead.

> We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.

So what do you suggest? UTF-16 and UTF-32 encourage this even more.

next

Legal | privacy