Hacker Read

Hacker Read top | best | new | newcomments | leaders | about | bookmarklet

login

marcodiego 2023-04-06 08:27:12 | [–] update item (on: C Strings and my slow descent to madness )

> If we try to print out some Japanese characters… [] The output isn’t what we expect.

Yes it is. And I bet on a modern windows version it is too. The terminal has been (probably intentionally) neglected by ms for a long time, but as far as I know this has mostly been fixed on modern windows versions.

EDIT: Author admits it later in the text "will be fixed in Windows 11 and Windows Server 2022"

Also it says "strlen("????")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".

sort by:

page size:

nordsieck | karma 6553 | avg karma 3.64 | 2023-04-06 10:07:50 | [–] similar comments (on: C Strings and my slow descent to madness )

> Also it says "strlen("????")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".

Yeah - when dealing with Unicode, you have to be very clear about whether you're dealing with bytes, runes or glyphs.

josephg | karma 16849 | avg karma 4.67 | 2024-04-30 13:12:57 | [–] similar comments (on: You can't just assume UTF-8 )

> So probably better to let the application provide its own implementation.

I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:

- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.

- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.

- Number of grapheme clusters in the text when displayed.

These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).

Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.

nwellnhof | karma 1060 | avg karma 3.15 | 2022-11-07 15:01:13 | [–] similar comments (on: C isn't a programming language anymore )

Well, different architectures have different word sizes. What should strlen return? A 64-bit value on every platform?

anon1385 | karma 10161 | avg karma 7.48 | 2015-12-18 10:13:45 | [–] similar comments (on: Why Python 3 Exists )

>There are plenty of uses for getting the length of a string where you will not be running into combining characters, clusters or any other things which would require special handling.

How do you know that? Any user supplied text may contain those things.

>remember that on the internet, all the data you receive is initially in string form

It's not. It's bytes.

>very often we want to receive data representing fixed-length values such as phone numbers or credit card numbers

This is a good example of why strings shouldn't have a length property - people don't understand what length actually does in most languages. If you think that a len() function can verify that the contents of a text field is in the format of a CC number then you don't understand the len function. It can't do what you are asking. See the examples given in the GP (len("\u0301") for example).

The fact that the people who advocate for length functions want to use them for things that they are broken for is exactly why length functions shouldn't exist.

thaumasiotes | karma 22020 | avg karma 1.35 | 2022-12-03 09:02:15 | [–] similar comments (on: Show HN: A nice C string API )

> The advantage of relying on a terminator symbol is that the string size can be any length where as storing the size at the start forces the string to not exceed certain size.

In the same way that since we identify unicode code points with a 16-bit value, it's impossible to include U+1D460 in a string?

In the same way that since Matroska files encode the length of their segments, there's a hard upper limit on the length of a segment?

Of course none of those things is actually true. Storing the string size has no implications for how long the string can be. It requires an amount of space, to store the string size, that is logarithmic in the length of the string, and completely insignificant.

jstimpfle | karma 3971 | avg karma 1.31 | 2021-12-10 14:41:16 | [–] similar comments (on: Ask HN: What Happened to Borland? )

Scanning for the string length (e.g strlen()) is asymptotically worse than reading a fixed size integer, so obviously don't do that unless it's a good memory/speed tradeoff (i.e. when you know the string is at most say, 16 bytes long).

Overall, it seems you didn't read my comment either. Or was I _that_ unclear?

Joeri | karma 11462 | avg karma 3.73 | 2013-01-25 10:10:40+00:00 | [–] similar comments (on: "I couldn't really learn Erlang, 'cos it didn't exist, so I invented it" )

He's right though, PHP has no reliable way to obtain the length of a string in characters, unless you keep track of which character set a string is in and carefully manipulate the mbstring functions.

Doing multibyte string handling properly in PHP is way harder than it should have been.

jolux | karma 6163 | avg karma 2.58 | 2021-01-04 00:17:25 | [–] similar comments (on: Rust is now overall faster than C in benchmarks )

The comment was illustrating why not including the length is a problem: it lead to a community norm that is bad for performance (stdlib functions not taking string length).

Joeri | karma 11462 | avg karma 3.73 | 2013-01-25 09:59:37+00:00 | [–] similar comments (on: "I couldn't really learn Erlang, 'cos it didn't exist, so I invented it" )

Except strlen does not give you the length of a string, but the size of a string (in bytes), unless ofcourse mbstring's func_overload directive is enabled.

emmelaich | karma 6824 | avg karma 1.64 | 2018-11-03 23:51:35+00:00 | [–] similar comments (on: How to implement strings )

> he only advantage of this representation is space efficency.

Not even close to true. Simplicity, future proofing are two more at least.

Reference to PDP-11 assembler is a furphy.

You only have two ways to store anything at the lowest level: 1. length + data or 2. data with sentinel value.

If K&R chose option 1 we would have versions of strings with 8 bitlengths, 16 bit lengths, ..... all the while incurring a base load of inefficiency within any program. (In fact I'd warrant programs would use data+sentinel internally anyway.)

There are now many many "safe string" libraries for C. Use them if you like. The fact there are so many and get so little use tells you something.

Is length+data safer? It's easy to lie on the wire, so I don't think so. If much.

BTW, one way to provide future proof (length, data) format is to encode the length with UTF-style encoding. So the length field would have enormous range.

leetcrew | karma 7947 | avg karma 2.01 | 2019-01-22 13:22:11+00:00 | [–] similar comments (on: Inside the C Standard Library )

> Realistically I rarely have strings that are 700 MB large

maybe not that long, but it's not that hard to lose a '\0' when serializing complicated data structures to disk. when you read the file back in, suddenly one of your structs contains an arbitrarily long string. i've seen several 32+MB strings get created this way.

zik | karma 3219 | avg karma 5.86 | 2014-09-25 00:44:34+00:00 | [–] similar comments (on: Adding strlcpy() to glibc )

The author of the article got this wrong. The return value is the number of characters it tried to write so it's easy enough to compare with the buffer size to see if truncation occurred.

coldtea | karma 86593 | avg karma 2.38 | 2019-04-25 14:02:04+00:00 | [–] similar comments (on: Some Were Meant for C (2017) [pdf] )

If this a joke?

Length not known, so prone to overflows at anytime, atrocious standard library, ... (and let's not even go into the Unicode situation).

StavrosK | karma 1 | avg karma 0.0 | 2015-06-01 13:20:44+00:00 | [–] similar comments (on: Strncpy() is not a “safer” strcpy() )

> That second paragraph means that if the string pointed to by s2 is shorter than n characters, it doesn't just copy n characters and add a terminating null character, which is what you'd expect.

But what I would expect is that it would copy len(s2) characters, not n.

WalterBright | karma 71923 | avg karma 2.96 | 2022-08-09 08:54:21 | [–] similar comments (on: The case against a C alternative )

> that has an acceptable tradeoff for performance vs space and simplicity for where they are used

Is it? I've been programming strings for 45 years now. Including on 8 and 10 bit machines. All that space efficiency goes out the window when one wants a subset of a string that isn't a common tail.

The simplicity goes out the window as soon as you want a substring that isn't a common tail. Now you have memory allocation to deal with.

The performance goes out the window because now the entire string contents has to be loaded into the cache to determine its length.

> length-prefixed

Are worse. Which is why I didn't mention them.

> Sane programs use store length

Meaning they become length-delineated programs, except it's done manually, tediously, and error-prone.

Whenever I review C code, the first thing I look at are the strlen/strncpy/str** sequences. It's almost always got a bug in it, an off-by-one error.

kreco | karma 98 | avg karma 2.33 | 2019-09-13 07:59:17+00:00 | [–] similar comments (on: Efficient string copying and concatenation in C )

> For one thing how big (sizeof) should the length prefix be? 16-bits? 32-bits? 64-bits?

Exactly like what is returned by strlen... ?

garmaine | karma 2787 | avg karma 0.77 | 2021-11-22 00:14:17 | [–] similar comments (on: TIL the assumption that string length does not change when upper-cased is false )

> Except almost everyone always means #2.

I don't think this is true. Certainly there is no string-length library I'm aware of that handles it that way. The usual default these days (correct or not) is #4 -- length is the number of unicode code points.

Arnavion | karma 7715 | avg karma 3.11 | 2023-06-02 00:47:10 | [–] similar comments (on: It’s not wrong that "???????".length == 7 (2019) )

>then multiply your output by 4.

That is not how UTF-32 works.

>but the base internal .length function should just output bytes.

Do you think the length of an `int64_t[3]` array should be 3 or 24?

ollien | karma 1099 | avg karma 3.79 | 2022-01-19 20:26:16 | [–] similar comments (on: Are We Really Engineers? )

> A software engineer should fully understand a function that calculates the length of a string...

Once you get into language semantics of iterating over characters vs bytes vs graphemes, it's not always so trivial.

Legal | privacy