Hacker Read

carlmr · 2019-02-28 11:35:26+00:00

We can make it more fair by pricing UTF-8 bytes.

??? = 9 bytes

how are you = 11 bytes

criddell | karma 19148 | avg karma 2.04 · | 2023-05-10 15:50:06

Bytes are understandable but make no sense from a business point of view. If you submit the same simple query with UTF-8 and UTF-32, the latter will cost 4x as much.

fedd | karma 255 | avg karma 0.77 · | 2011-02-08 06:38:24+00:00

and i find it fair that American characters require 2 bytes in Java as everybody else, not 1 as in utf-8! :)

axod | karma 15942 | avg karma 2.86 · | 2009-04-29 18:26:43

because it'd have an extreme effect on performance?

UTF-8 is great in terms of size, but terrible in terms of speed.

reply

KingLancelot | karma 44 | avg karma 0.09 · | 2023-01-07 18:03:49

UTF-8 is 1-4 bytes per codepoint dude, not 1-2.

280 * 4 = 1120, not 560.

reply

skizm | karma 3686 | avg karma 3.6 · | 2013-07-09 01:12:24+00:00

I think the majority of the time you want the number of utf8 characters and not the byte count. In fact I have never wanted the byte count. If I did I would expect something like byteLen and Len. Not the other way around. You should be optimizing the common case, not the exception. Obviously I'm not a language designer so perhaps I'm talking out of my ass but I've heard this complaint A LOT.

crawshaw | karma 2611 | avg karma 11.16 · | 2018-08-18 20:45:26+00:00

Also: UTF-8 instead of UTF-16 can make a surprising difference on memory use, especially if you interact with services in UTF-8 and so spend a lot of memory in Java copying back and forth.

alecco | karma 8121 | avg karma 3.01 · | 2012-04-29 12:50:04

ASCII and UTF-8 are too US centric. That's why adoption in places like China is so low.

Also, if there's variable length encoding why can't we just do a proper way and improve size for the same computational cost?

reply

coldtea | karma 86593 | avg karma 2.38 · | 2022-07-24 12:40:03

>It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

For tons of questions, both can be correct.

reply

lugg | karma 2363 | avg karma 2.28 · | 2019-04-25 12:12:29

1 to 4 bytes.

How does utf8 make that difficult to answer?

When was the last time you iterated over a string of unicode points and said you know what would be handy right now? If these code points were split up into arbitrary and unusable bytes of memory.

reply

LK5ZJwMwgBbHuVI | karma 23 | avg karma 1.15 · | 2024-03-20 17:44:08

> In theory yes, in practice no.

That's like "in theory we need 4 bytes to represent Unicode, but in practice 3 bytes is fine" (glances at universally-maligned utf8mb3)

reply

billpg | karma 5306 | avg karma 3.92 · | 2022-01-12 11:40:29

Dear databases, please don't get hung up about string lengths when dealing with UTF8.

If I ask for a UTF8 string with a max-length of 100, please don't apply the worse case scenario and allocate space for 100 emojis. Please give me a box of 100 bytes and allow me to write any UTF-8 string that can fit into 100 bytes in there.

100 ASCII characters. 20 emojis. Any mixture of the two.

If I ask for UTF-8, it'll be because I'd like to make advantage of UTF-8 and I accept the costs. If that means I can't quickly jump to the character at index 84 in a string, no problem, I've accepted the trade-off.

reply

cesarb | karma 14181 | avg karma 3.67 · | 2023-07-26 16:33:30

One annoyance with that approach is that a single non-latin-1 codepoint is enough to double the size of the String; if it were an UTF-8 String, it would only add a couple of extra bytes.

jchw | karma 23678 | avg karma 5.27 · | 2018-11-30 16:04:19+00:00

Even this argument isn't great, though, because a lot of the time it doesn't matter. Like, for HTML, the majority of the time the UTF-8 version will be shorter regardless because of all of the single-byte characters used in the markup.

simonw | karma 58201 | avg karma 7.31 · | 2010-08-20 10:49:30+00:00

I got the impression that was a performance trade-off. UTF-8 decoding/encoding isn't free.

hetman | karma 1170 | avg karma 3.02 · | 2015-12-19 22:48:37

But not every 8-bit byte string is valid UTF-8 so that could still cause a world of pain.

jandrese | karma 30121 | avg karma 3.36 · | 2020-04-14 12:35:57

> It is a pain in the ass to have a variable number of bytes per char.

Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost every situation. The encoding time tradeoff was considered less objectionable than making most of your text twice or four times larger.

reply

jmillikin | karma 10854 | avg karma 7.18 · | 2010-05-12 09:42:45

"But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary."

What? No! UTF-8 takes, at most, 4 bytes per code point.

"But UTF-8 isn't very efficient at storing Asian symbols, taking a whole three bytes. The eastern masses revolted at the prospect of having to buy bigger hard drives and made their own encodings."

Many asian users object to UTF-8/Unicode because of the Han Unification, and because many characters supported in other character sets are not present in Unicode. Size of the binary encoding has nothing to do with it -- in fact, most east-asian characters take 4 bytes in UTF-16.

"American programmers: In your day to day grind, it's superfluous to put a 'u' in front of every single string."

American programmers who aren't morons: Use 'u' or the first time somebody tries to run an accent through your code, it'll come out looking like line noise.

reply

mikeash | karma 74524 | avg karma 3.52 · | 2012-11-27 01:38:22+00:00

Such an API could easily indicate bytes in a UTF-8 string instead.

jeltz | karma 6278 | avg karma 2.2 · | 2020-04-14 18:10:22+00:00

> We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.

So what do you suggest? UTF-16 and UTF-32 encourage this even more.

reply