> If we try to print out some Japanese characters… [] The output isn’t what we expect.
Yes it is. And I bet on a modern windows version it is too. The terminal has been (probably intentionally) neglected by ms for a long time, but as far as I know this has mostly been fixed on modern windows versions.
EDIT: Author admits it later in the text "will be fixed in Windows 11 and Windows Server 2022"
Also it says "strlen("????")); [...] and the output is…
The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".
> Also it says "strlen("????")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".
Yeah - when dealing with Unicode, you have to be very clear about whether you're dealing with bytes, runes or glyphs.
> So probably better to let the application provide its own implementation.
I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:
- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.
- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.
- Number of grapheme clusters in the text when displayed.
These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).
Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.
>There are plenty of uses for getting the length of a string where you will not be running into combining characters, clusters or any other things which would require special handling.
How do you know that? Any user supplied text may contain those things.
>remember that on the internet, all the data you receive is initially in string form
It's not. It's bytes.
>very often we want to receive data representing fixed-length values such as phone numbers or credit card numbers
This is a good example of why strings shouldn't have a length property - people don't understand what length actually does in most languages. If you think that a len() function can verify that the contents of a text field is in the format of a CC number then you don't understand the len function. It can't do what you are asking. See the examples given in the GP (len("\u0301") for example).
The fact that the people who advocate for length functions want to use them for things that they are broken for is exactly why length functions shouldn't exist.
> The advantage of relying on a terminator symbol is that the string size can be any length where as storing the size at the start forces the string to not exceed certain size.
In the same way that since we identify unicode code points with a 16-bit value, it's impossible to include U+1D460 in a string?
In the same way that since Matroska files encode the length of their segments, there's a hard upper limit on the length of a segment?
Of course none of those things is actually true. Storing the string size has no implications for how long the string can be. It requires an amount of space, to store the string size, that is logarithmic in the length of the string, and completely insignificant.
Scanning for the string length (e.g strlen()) is asymptotically worse than reading a fixed size integer, so obviously don't do that unless it's a good memory/speed tradeoff (i.e. when you know the string is at most say, 16 bytes long).
Overall, it seems you didn't read my comment either. Or was I _that_ unclear?
He's right though, PHP has no reliable way to obtain the length of a string in characters, unless you keep track of which character set a string is in and carefully manipulate the mbstring functions.
Doing multibyte string handling properly in PHP is way harder than it should have been.
The comment was illustrating why not including the length is a problem: it lead to a community norm that is bad for performance (stdlib functions not taking string length).
Except strlen does not give you the length of a string, but the size of a string (in bytes), unless ofcourse mbstring's func_overload directive is enabled.
> he only advantage of this representation is space efficency.
Not even close to true. Simplicity, future proofing are two more at least.
Reference to PDP-11 assembler is a furphy.
You only have two ways to store anything at the lowest level: 1. length + data or 2. data with sentinel value.
If K&R chose option 1 we would have versions of strings with 8 bitlengths, 16 bit lengths, ..... all the while incurring a base load of inefficiency within any program. (In fact I'd warrant programs would use data+sentinel internally anyway.)
There are now many many "safe string" libraries for C.
Use them if you like. The fact there are so many and get so little use tells you something.
Is length+data safer? It's easy to lie on the wire, so I don't think so. If much.
BTW, one way to provide future proof (length, data) format is to encode the length with UTF-style encoding. So the length field would have enormous range.
> Realistically I rarely have strings that are 700 MB large
maybe not that long, but it's not that hard to lose a '\0' when serializing complicated data structures to disk. when you read the file back in, suddenly one of your structs contains an arbitrarily long string. i've seen several 32+MB strings get created this way.
The author of the article got this wrong. The return value is the number of characters it tried to write so it's easy enough to compare with the buffer size to see if truncation occurred.
> That second paragraph means that if the string pointed to by s2 is shorter than n characters, it doesn't just copy n characters and add a terminating null character, which is what you'd expect.
But what I would expect is that it would copy len(s2) characters, not n.
> that has an acceptable tradeoff for performance vs space and simplicity for where they are used
Is it? I've been programming strings for 45 years now. Including on 8 and 10 bit machines. All that space efficiency goes out the window when one wants a subset of a string that isn't a common tail.
The simplicity goes out the window as soon as you want a substring that isn't a common tail. Now you have memory allocation to deal with.
The performance goes out the window because now the entire string contents has to be loaded into the cache to determine its length.
> length-prefixed
Are worse. Which is why I didn't mention them.
> Sane programs use store length
Meaning they become length-delineated programs, except it's done manually, tediously, and error-prone.
Whenever I review C code, the first thing I look at are the strlen/strncpy/str** sequences. It's almost always got a bug in it, an off-by-one error.
I don't think this is true. Certainly there is no string-length library I'm aware of that handles it that way. The usual default these days (correct or not) is #4 -- length is the number of unicode code points.
Yes it is. And I bet on a modern windows version it is too. The terminal has been (probably intentionally) neglected by ms for a long time, but as far as I know this has mostly been fixed on modern windows versions.
EDIT: Author admits it later in the text "will be fixed in Windows 11 and Windows Server 2022"
Also it says "strlen("????")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".
reply