Bytes are understandable but make no sense from a business point of view. If you submit the same simple query with UTF-8 and UTF-32, the latter will cost 4x as much.
I think the majority of the time you want the number of utf8 characters and not the byte count. In fact I have never wanted the byte count. If I did I would expect something like byteLen and Len. Not the other way around. You should be optimizing the common case, not the exception. Obviously I'm not a language designer so perhaps I'm talking out of my ass but I've heard this complaint A LOT.
Also: UTF-8 instead of UTF-16 can make a surprising difference on memory use, especially if you interact with services in UTF-8 and so spend a lot of memory in Java copying back and forth.
>It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.
When was the last time you iterated over a string of unicode points and said you know what would be handy right now? If these code points were split up into arbitrary and unusable bytes of memory.
Dear databases, please don't get hung up about string lengths when dealing with UTF8.
If I ask for a UTF8 string with a max-length of 100, please don't apply the worse case scenario and allocate space for 100 emojis. Please give me a box of 100 bytes and allow me to write any UTF-8 string that can fit into 100 bytes in there.
100 ASCII characters. 20 emojis. Any mixture of the two.
If I ask for UTF-8, it'll be because I'd like to make advantage of UTF-8 and I accept the costs. If that means I can't quickly jump to the character at index 84 in a string, no problem, I've accepted the trade-off.
One annoyance with that approach is that a single non-latin-1 codepoint is enough to double the size of the String; if it were an UTF-8 String, it would only add a couple of extra bytes.
Even this argument isn't great, though, because a lot of the time it doesn't matter. Like, for HTML, the majority of the time the UTF-8 version will be shorter regardless because of all of the single-byte characters used in the markup.
> It is a pain in the ass to have a variable number of bytes per char.
Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost every situation. The encoding time tradeoff was considered less objectionable than making most of your text twice or four times larger.
"But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary."
What? No! UTF-8 takes, at most, 4 bytes per code point.
"But UTF-8 isn't very efficient at storing Asian symbols, taking a whole three bytes. The eastern masses revolted at the prospect of having to buy bigger hard drives and made their own encodings."
Many asian users object to UTF-8/Unicode because of the Han Unification, and because many characters supported in other character sets are not present in Unicode. Size of the binary encoding has nothing to do with it -- in fact, most east-asian characters take 4 bytes in UTF-16.
"American programmers: In your day to day grind, it's superfluous to put a 'u' in front of every single string."
American programmers who aren't morons: Use 'u' or the first time somebody tries to run an accent through your code, it'll come out looking like line noise.
??? = 9 bytes
how are you = 11 bytes
reply