Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
C Strings and my slow descent to madness (www.deusinmachina.net) similar stories update story
145 points by Decabytes | karma 3666 | avg karma 5.6 2023-04-06 07:22:10 | hide | past | favorite | 329 comments



view as:

In well-written C, you don't work with strings the way you do in other HLLs. For example, extracting and copying substrings is something unnecessary, unless you want to modify the parent string. Otherwise, a substring is represented by a pointer and a size_t length, and can easily be printed that way via the "%.*s" printf specifier:

    const char *s = "Hello World!";
    const char *world = s + 6;
    size_t world_len = 5;
    printf("%.*s\n", world_len, world);

On other HLLs it is easy to have subviews on other strings. C makes is needlessly hard by requiring null termination in half the APIs.


I love how in every C code snippet on every comment on this thread, somebody got something wrong. I take it as a sign that it's probably best to avoid C as much as possible.

! multithreaded it's not Hey, at least code

Any C compiler that isn't a trivial toy implementation will warn about that, so it's hardly a C gotcha.

clang and VC++ for x64 warn about this out of the box, but gcc seems to need -Wall.

Maybe my least favorite "feature" of C. I can manage most aspects of zero-terminated strings well enough, but when I have to specify the length of them, is it an 'int', 'size_t', 'ssize_t', or something else? (Answer: All of the above!)

It's unfortunate the author put the arrays-are-pointers thing so early in the doc, as that's a very beginner-to-C mixup and really nothing at all to do with strings. Otherwise, yep. It's pretty bad. C is a great language, but its string handling is definitely garbage. You get used to it pretty quick, and it's not hard to write a handful of sane wrappers or a simple string library for your own use, but the standard library's terrible string functions are an unending source of bugs.

I don't see any mention or insinuations of arrays-are-pointers anywhere in the article. Am I missing something?

This bit:

    But you might be asking. “Why can’t I just assign the source variable directly to the destination variable?”

    int main() {
      char source[] = "Hello, world!";
      char* destination = source;
    
      strcpy(destination, source); // Copy the source string to the destination string
    
      printf("Source: %s\n", source);
      printf("Destination: %s\n", destination);
    
      return 0;
    }
    You can. It’s just that destination now becomes a char* and exists as a pointer to the source character array. If that isn’t what you want them this will almost certainly cause issues.

This is almost a cliche among many C language lawyers and/or Stack Overflow answer-rich people and I know you mean well, but: arrays are not pointers.

In some contexts, the name of an array decays to a pointer to its first element. That is a better way of putting it, and it's a (much) weaker statement.

Edit: if they were the same, this code:

    int foo[] = {1, 2, 3};
    int *bar = foo;
    printf("%zu and %zu\n", sizeof foo, sizeof bar);
Would print the same valde twice, but it doesn't. On Ideone [1] I got 12 and 8.

[1]: https://ideone.com/CP7WTu


This also makes a big difference once we start talking about pointers to arrays.

    int a[] = {1, 2, 3}
    int (*p1)[3] = &a; // ok
    int (*p2)[3] = &a[0]; // not ok
    int *p3 = &a; // not ok
(It should be noted that these will compile with warnings in C due to implicit conversions via void*, but you're still risking UB if you actually use the resulting value. They are all errors in C++ because it doesn't have implicit conversion from void*.)

Nothing wrong with your third line. Did you mean something else?

I forgot the &; comment updated now, and I added another example.

Got it, thanks!

> If we try to print out some Japanese characters… [] The output isn’t what we expect.

Yes it is. And I bet on a modern windows version it is too. The terminal has been (probably intentionally) neglected by ms for a long time, but as far as I know this has mostly been fixed on modern windows versions.

EDIT: Author admits it later in the text "will be fixed in Windows 11 and Windows Server 2022"

Also it says "strlen("????")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".


It makes sense to point it out even if it's fixed in win11, lots of people (myself included) are still on 10.

> The terminal has been (probably intentionally) neglected by ms for a long time,

I don't think it is an intentional lack of care, just a lack of care. Internally MS devs affected by the appalling state of the console just did what the rest of us did and installed an alternative.

> but as far as I know this has mostly been fixed on modern windows versions.

Ish. The default console for powershell is better, but a lot of improvements you might be thinking are in there are in fact only in Windows Terminal (https://en.wikipedia.org/wiki/Windows_Terminal) which is not currently included by default.


> you might be thinking are in there are in fact only in Windows Terminal

A lot of those changes are in ConsoleHost, so Windows 10 and 11 get those improvements (like VT100 sequences) in cmd.exe as well


> Also it says "strlen("????")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".

Yeah - when dealing with Unicode, you have to be very clear about whether you're dealing with bytes, runes or glyphs.


Runes are not a Unicode concept - that’s a Golangism. Basically a code point.

Also in terms of Unicode, graphemes are even more relevant to the programming side than glyphs - unless you’re writing a renderer.


> And I bet on a modern windows version it is too.

It's still broken unfortunately, you need to switch the console to a special UTF-8 codepage in your own code:

    SetConsoleOutputCP(CP_UTF8);
...and before exit restore it to the original code page.

... except that that is also subtly broken.

It works if you write multiple UTF-8 code-units in one go, but breaks if you send them in several writes (and by that, I mean direct writes to the HANDLE). It also breaks if you try to use the ANSI API (with the A suffix), as it internally tries to convert the bytes from codepage-random to UTF-8.

You run into both issues if you try to use the MS implementation of stdio (printf and friends).

And we didn't even discuss command line argument passing yet :-)

I had a lot of fun with this (more explanation in the issue comments): https://github.com/AgentD/squashfs-tools-ng/issues/96#issuec...

I tried to test it with the only other two languages I know besides English: German and Mandarin. Specifically also, because the later requires multi-byte characters to work. Getting Chinese text I/O to work at all in a Windows DOS box, on an existing, German Windows 7 installation was an adventure on it's own and ended up breaking things in different ways than German text.

Turns out, trying to write language agnostic command line applications on Windows is a PITA.


Windows is truly the gift that keeps on giving :D

Honestly if you are expecting a sane and modern text console on Windows you're just begging for disappointment. I note that even the author briefly tries it on Bash and finds it to be undramatic.

With the woes of string.h being known, why not just use an alternative like https://github.com/antirez/sds ?

I’ve also been having a blast with C because writing C feels like being a god! But the biggest thing that I like about C is that the world is sort of written on it!

Just yesterday I needed to parse a JSON… found a bunch of libraries that do that and just picked one that I liked the API.


>>I’ve also been having a blast with C because writing C feels like being a god

Not trying to be a troll but as someone who has also written a lot of C in the past why do you feel like this?


It's not doing as many things behind your back as the dynamic languages and C++ do. More things are your responsibility.

It’s the access and control that it gives me!

As when I’d pick Go because I was doing some concurrency, I can now explore a bunch of concurrency libraries, including some implementations that look a lot like Go channels. Want to watch a file for changes? I can do that all the way from taking to the kernel to picking a multi-platform library. I guess, I haven’t really found anything that I can’t do in C, and if I’m lazy, I can just embed and scripting language to handle things in a higher level. Macros are also very powerful! I’ve been writing code that writes code for me, export the thing to a .h file and import it using #include.

pkg-config —list-all has become my friend and I keep discovering that the world is written in C and the access to libraries is huge!

Also, idiomatic C is whatever you make it. There are a bunch of ways to skin a cat. Want a different platform? Build tool? Compiler? Debugger? Wanna write you own debugger? C is chill with that.

It’s also such a simple language that without much effort you can know everything about it (I don’t care that much about anything over C99). I don’t know the whole ecosystem, or standard libraries, or data structures and algorithms and whatnot, but the language itself is quite trivial.

With that said, I’m not using C in project teams. In that setting some strong conventions would likely be necessary or even better, something enforced by tools or the compiler (like Go), but yeah, I’ve been quite enjoying working with C and being kind of annoyed at other langs that I need to use for work because they keep doing all this stuff behind my back that is supposed to help me, but instead is a pain trying to debug and understand what is actually happening


Thank You! That was a nice explanation.

Sorry to be "that person", but have you tried Rust yet? It checks a lot of your boxes:

- Access and control, nothing "behind your back"

- Low or high level, as you prefer

- Swap in different implementations (custom allocator, different async runtime, etc)

- Really powerful macros

- Strong conventions, safe by default (but you can break them, go into the weeds if needed)

Downsides compared to your list:

- More complex than C or Go (though less than C++)

- Only one production compiler, and everyone assumes `cargo` build system (though both are very good)

- Library ecosystem not quite as extensive (though there is a lot of good stuff on crates.io, and you can always write bindings to C)

The little things that seal the deal:

- Enums (tagged unions without the danger or boilerplate)

- Zero-cost closures

- Incremental compilation

- "If it compiles, it probably works"

- Standardized documentation via `rustdoc`

- Module system


hehe! It's ok to be "that person" :)

Yes, I've tried rust and have even shipped some project with it! I think the things that didn't worked for me was the complexity. It felt like I had to keep a lot of things in mind to be effective, (traits can be a bit obscure, i.e. magic IMHO), also lifetimes and Option made the code complex by either having a bunch of math or .unwrap all over the place.

With that said tho, Rust would be one of my top picks for a professional setting or a codebase that I share with a team, because of the really good defaults that it has! I read once somebody comparing the Rust compiler with a bunch of tiny unit tests that the developer doesn't have to write, and I agree with that!

With C tho, for personal stuff I can do things that I've never do professionally (i.e. use the OS as my GC because when the process dies the OS is gonna "free" my memory allocations anyways! I know, terrible, but I'm having fun! ¯\_(?)_/¯)


I came from HLL like C# and I've come to love C for the same reasons. The only language I can fit in my head in its entirety and no one telling me this is the "right way" to write code, don't use this or that feature, etc. When I'm writing C, I am free. I absolutely love this freedom!

Exactly!

Yes, I do think that you cannot fairly dismiss C because of strcpy_l only being available on Windows when it's quite possible to implement it yourself or use a library like the one you mention.

I encourage people to use sds if that's the best option.

However, I don't think that's the best option if you can roll your own.

I personally think that there should be two different types of strings: static and dynamic ones. The static ones should not be able to be changed, but the dynamic ones can serve as a "string builder" type of sorts.

Second, I don't see sds's first advantage (in the README) to be an advantage. Sure, you may have to explicitly pass the buffer in to C functions, but that tells you that you're calling a function that takes a char array rather than your string. It makes it more explicit.

Second, if you use my method of splitting static strings from dynamic strings, then sds's second advantage doesn't apply because the pointer will never change.

But the disadvantages of sds still apply, and both disadvantages are big since they easily lead to bugs. Hence, I think sds is not the best option if you can make your own.

Oh, another advantage of the static/dynamic split: I can implement small string semantics. For small enough strings, I use a union to put the array into the same bytes as the pointer, so on 64-bit machines, I can have 8-byte strings (including nul) before needing an actual allocation.


> With the woes of string.h being known, why not just use an alternative like https://github.com/antirez/sds ?

That library really doesn't address any of the Unicode issues.


Was going to say the same thing.

If you want Unicode in C a wrapper library is pretty much a given.

When I was adding Unicode support to the small scheme interpreter I like playing with I found a super simple string library, a bunch of generated code (who doesn’t love 1000 line switch statements) for dealing with utf-8 code points and bob’s your uncle. Could have probably found a library that did it all but the goal was learning and yak shaving.

Haven’t ever messed with the wide strings, seems like more of a hassle than they’re worth.


> I’ve also been having a blast with C because writing C feels like being a god!

That's funny, because I look at it in the opposite way: it makes me feel like a super-fallible human because it's so easy for me to break things in horrible ways, something a hypothetical god would not do.

Something like Rust &str or String would make me feel more like a god, as I can do whatever I want (more or less) without worrying about safety.


A hypothetical god would not destroy the world, but absolutely could.

I think that's the idea - they're not reveling in the fact that they can do anything anybody could reasonably want to do, they're reveling in the fact that they can also do everything else, too.


Yes this is something to get use to. The BSDs created strlcpy(3) and wcslcpy(3)

https://man.openbsd.org/strlcpy.3

https://man.openbsd.org/wcslcpy.3

which to me will help with some of these issues. Too bad other Operating Systems do not have these. On Linux there is libbsd to get these, but I would like to see these to be added to the stdc.

Instead the c23 standard is messing with realloc(3) which could break some old programs. I have not looked at that in detail yet, so maybe it is a non-issue :)


I don't know of a compiler that forces you to use the newest version of the standard, which is why I've always kind of thought "don't break old code" was treated too much like dogma. So from that perspective, a non issue.

However, there is a problem that has nothing to do with old code: they increased the number of situations that constitute undefined behavior, with no public discussion and no justification. It's frankly dangerous behavior.


Ushering out strlcpy() https://lwn.net/Articles/905777/

These functions are in the current POSIX draft - though not published, it's quite unlikely to be removed (someone actually specifically filed an issue against POSIX to try and get it removed, basically on the basis of "it's not perfect so it should be removed", and the issue got rejected on the basis that there's no consensus for removal, and it seems unlikely this will change), and as a result, the functions are getting added to glibc: https://sourceware.org/pipermail/libc-alpha/2023-April/14696...

strlcpy is nice due to the guaranteed NUL termination.

strlcpy is not so nice due to the strange (IMO) return value of the number of characters in the source string. Which could be the number of characters copied or much, much larger than the number of characters copied. snprintf does the same thing.

So using strlcpy is safe (by C's low bar) but using the return value may be highly unsafe.


The thing that annoys me the most about strlcpy is that it is supposed to be safer, but what happens in the case where the source string is not properly NULL terminated? You might think that it will stop at the character limit you specified, but that's not what it does. It just blows on past the end of the buffer looking for a \0 until it either finds one or causes a segmentation violation.

IMHO I would like it much more if the return values were:

  0: string copied
  1: string partially copied but truncated
  -1: Error, errno set.  This can occur when src or dst are NULL.

Linux's strscpy addresses these issues.

Which is great if you're a kernel developer, a bit of a moot point for application developers.

> what happens in the case where the source string is not properly NULL terminated

That’s not a string.


None of strlcpy, strncpy, and strcpy will know that you have not provided a string. They will assume the source pointer is a string and as such, will read (and write, in the case of strcpy) bytes until they find that NUL.

This is the upside of strlcpy. Whatever is in your output buffer is guaranteed to be a NUL terminated and have your desired length. strncpy does not make that guarantee. strcpy will give you something with a NUL terminator but it could be well past the end of the output buffer. Hello, CVE.

The more I write here, the more I realize how silly it would be to write anything dealing with human-readable text in C in 2023. I had been working on a C webserver a while back but I think I'm going to purge that from my local git server and start over with something else.


Passing something that is not a string to strcpy or strlcpy is undefined behavior. They operate on strings, which are null-terminated by definition. On the flip side, strncpy operates on character arrays, which do not have to be null-terminated. (This is also why the output buffer is not always null-terminated: it's not meant to represent a string, despite the highly confusing str- prefix.)

Look at the declarations:

  char *strcpy(char *dest, const char *src);
  char *strncpy(char *dest, const char *src, size_t n);
  size_t strlcpy(char *dest, const char *src, size_t size);
From that, how do you know that strncpy expects and produces a character array, strcpy expects and produces a string, and strlcpy takes either a character array or string and produces a string?

Your descriptions of the string/character copy functions are factual and accurate. But correct use depends on programmer understanding. You do not get any runtime guarantee. And IME, when you are under a deadline, 13 function calls and 3 message queues deep in some ancient codebase, while trying to get a non-trivial feature working, the distinction between a character array and a string is easily forgotten.

Anyways, my assertions are:

1) Using C for strings is a poor choice 2) If your compiler vendor or cranky boss forces use of C, use strlcpy to handle string copying.


I don't disagree, although I would suggest using something other than strlcpy because it also may not do what you want.

Yup, there is also Linux's strscpy which doesn't require reading memory from the source string beyond the specified "count" bytes and the return value is idiot proof.

Well-written C tends to minimise string usage in general, preferring to convert to another format as soon as possible. Allocating, copying, and passing around strings in large quantities is not a good idea for efficiency, but of course some people coming from other HLLs seem to try to do it anyway, which causes many other problems.

THIS.

And programming, engineering, and life in general have so, SO many other situations where "X is not very good at doing Y". Yet (my experience) guys seem extremely resistant to the common-sense strategy of "then try to minimize how much Y you do with X".


to the point I often wonder if strings should exist.. buffers -> symbols | structs.

> preferring to convert to another format as soon as possible.

Like what?


A structure of binary fields.

I've been wondering lately why many people write c in c++ rather than just c. I think this might be the reason.

People write C in C++ because they don't actually know C++ and think it's "basically C with classes and strings".

There are legitimate reasons why someone would rather write C, but "I don't understand RAII" is not one of them.


C with basic templates is normally what I want. Occasionally other C++-isms drift in, but normally to the harm of the code quality.

C++ doesn’t require you to commit to all of its features and/or paradigms. Using it as you see fit is valid. Just don’t advertise yourself as a C++ programmer to the job market, as it’s not what most people expect.

There’s nothing wrong with “C with classes and strings” idea by itself, if that is your choice or a consciously sufficient level of competence.


It doesn't require you to commit to all of its features, that's certainly correct. But it does require you to commit to its principles; if you're needlessly passing naked pointers around, you're really writing C code with a C++ compiler.

I don’t see how it requires you to commit to any principles, if you can avoid those you don’t need and still successfully compile. That’s called “suggests” or “allows”, not “requires”. Yes, some people are writing C code with classes and strings in C++. That’s why we call this mode “C with classes and strings”. I believe that you are attached to these principles (see their benefits), and that is fine. But not everybody likes full-on C++.

For a long time on windows the official MS recommendations was to use the c++ compiler for c projects.

Which was exerbated by an obsolete c compiler that only supported c89.


the usual accusation is that many people write c++ code as if it were c. also, the the code in 2nd ed of K&R was all compiled with stroustrup's c++ compiler, as there wasn't a c compiler that could handle it.

I stopped when I read strcmp returns 0 if two strings are equal and 1 if they aren't.

A much better description of strcmp's behaviour: https://en.cppreference.com/w/c/string/byte/strcmp

It's actually 0 if equal, positive if greater, negative if less than.

  > The strcmp() and strncmp() functions return an integer greater than, equal to, or less than 0, according to whether the string s1 is greater than, equal to, or less than the string s2.  The comparison is done using unsigned characters, so that ‘\200’ is greater than ‘\0’.

cmp stands for compare, so the behaviour (returns <0, 0, or >0) is completely reasonable. With three possible outcomes, the function is suitable to be used for sorting.

It's sad how often you find the truth at the very bottom of a HN thread these days

Okay, I agree that by default, C strings are bad.

But it doesn't have to stay that way. Someone else in the comments mentioned antirez's sds library for dynamic strings. This works, but you could also easily roll your own. All you need is an init function, and perhaps an assert or other check at the end of it that the string has a nul terminator.

At that point, type checking will let you blindly pass those strings (or their char arrays) to any of those C functions without worry.

Edit: I'll also add that I think a string library should have a difference between static strings and string builders (dynamic strings). It makes everything easier.


> by default, C strings are bad.

C strings aren't bad. They can't be, because they don't exist. C doesn't have strings. And that is the issue.

As you say, things get a lot better when you actually introduce strings as a concrete concept rather than a set of lose conventions.


There is no "loose convention". A C string is a null terminated string of non-null bytes. That's the definition. Working with them in memory-unconstrained environments is unnecessarily hard.

There is indeed just convention. The language defines string constants similar to what you say[1] (an array of characters, terminated by a null character), but in the language itself there's no way to declare that a function takes a string rather than a pointer to a character. Alternatively if you work with a fixed-sized character array, there's nothing separating it from "just" an array of characters that are not null terminated.

So that strcmp expects a string rather than a pointer to a character is just convention. In languages which actually has strings as a concrete concept, like say Pascal and derivatives, you can actually differentiate between those two cases.

[1]: https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#...


That means that the strings aren't properly reflected in the type system. But the existence of string literals with a very definite in-memory layout means that it's not just a convention even so.

But you can't use those string literals in any way without relying on convention.

That was my point. Although you can, actually - since literals themselves are array-typed, you can sizeof them to get the character count without relying on null termination. It's even possible to get a non-null-terminated literal if the target array type is not large enough to fit null, e.g.:

   char s[3] = "foo"; // not null-terminated!

> character count

Byte count.


If you want to be pedantic, "an object declared as type char is large enough to store any member of the basic execution character set". It doesn't actually have to be a byte.

In practice, in C context, character == char == byte. Other concepts have to use different names to avoid confusion with the language spec.


In C a 'char' is a 'byte', if we're going to be extra pedantic.

> C doesn't have strings.

C has string literals though, and those bake a specific string representation into the language (of course libraries can use their own string representation, but those then need at least some conversion function from string literals).


String literals in C are simply a pointer to an array of bytes (generally speaking), and that's how they should be treated. Considering them as strings in the HLL sense is the entire issue here.

I don't think it's useful to pretend C doesn't have strings when it has string literals.

WUFFS doesn't have strings. That's what a language which doesn't have strings looks like, you can't write "Hello, world" in WUFFS because it involves Strings, which WUFFS doesn't have, and I/O, which WUFFS also doesn't have.

A pretence that C doesn't have strings because it lacks a concrete string type in the language itself also seems like you'd be claiming C++ doesn't have strings, Zig doesn't have strings, and Rust came pretty close to not having strings (for a while it was mooted to make Rust's str just a slice [u8] but today Rust does bless str as a distinct type even though e.g. &str and &[u8] aren't very different)


C++ before C++11 didn't. I've fixed errors in projects which was due to string literals not being std::string. After C++11 things are more murky due to user literals[1]. I'd lean towards saying the language still doesn't, but yeah, murky.

Zig and Rust I don't know enough about.

And I'm not pretending. C has string literals which are of a non-distinct type. You can't distinguish between a string literal and an array of characters. This is the crucial bit.

The result is that the standard library, and lots of other code, relies on convention alone to pass strings around. This has been and continues to be the source for countless serious bugs. The kind of bugs which are a total non-issue in languages which has strings.

[1]: https://en.cppreference.com/w/cpp/language/user_literal


There's also RapidString, though the original author seems to have disappeared off of the internet. Would be curious to see benchmarks against sds.

The best way to write C is to treat strings as memory locations with characters and nothing more. Every such memory location has an allocated size, either statically at compile-time (with literals and arrays), or dynamically with malloc and friends. Treat string operations as mere memory operations and don't imagine them to be something else. The str* functions from the standard library are just convenience helpers which are not adding "string" functionality whatsoever.

Maybe you could summarize this by saying that C strings are "strings of bytes", not "strings of characters".

It's better to say that, in C, characters are bytes. That's why coming at this from the modern perspective of "characters are the things on my screen" is always going to confuse you - C doesn't have bytes, only characters. C programmers (should) understand this the same way Lisp programmers understand that "CAR" and "CDR" refer to "first" and "rest" or Forth programmers understand that "the stack" is the data stack and has no relation to the call stack.

So a "char" is a byte, but a "char" is perhaps not a "character", except as a funny coincidence of jargon.

C strings are bad for sure. Consider those raw assembly. Instead of using it directly get some decent string library ASAP and use it exclusively.

They are just arrays, like everything else in the language. If you don't want to manage plain arrays, better look for a different language.

You are free to make good use of strings as arrays. I've written tons of code including firmware for MCU so I think I'll keep to my own practices.

they are not "just" arrays, they are arrays with a null-terminated expectation. Most of the time. The lack of consistency and difficulty communicating the expectations is what is hair-pulling in C, and that's on top of the difficulty of communicating the difference between an array and a pointer in C.

I don't use null terminated strings. ptr+len struct everywhere. And when I need to call an API, like fopen, I make a temporary copy of that string + the null termination, do my work and then free it.

You can printf non-null terminated strings too. Check printf("%.*s", length, strptr).


> You can printf non-null terminated strings too. Check printf("%.s", length, strptr).*

I haven't checked yet, but I'm about 90% confident that's UB. Is printf() guaranteed not to read to the end of the string when you give it a length?


Yes, it is. From the C11 spec:

> Characters from the array are written up to (but not including) the terminating null character. If the precision is specified, no more than that many bytes are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character.


Thanks! I guess I didn't realize expressio unius est exclusio alterius applies in the C standard :D

Won't read a single byte beyond the length you give it.

And this is a standard practice in libraries for printing pre-determined lengthed string.


A long time ago my solution was ptr+len but I allocated 1 more byte so that if a string had to be given to libc, I could terminate it at that time. No need for a copy then.

BASIC strings in Windows -- you store the length in the 4 bytes before the pointer to the string and put a null terminator at the end.

https://learn.microsoft.com/en-us/previous-versions/windows/...


If you are using C and do some non-trivial work with strings you should either use a good library to handle strings or build your own.

It is not that difficult in practice.

The old C std lib is, in my opinion, outdated, obsolete and a very bad fit for complex string handling, especially on the memory management side.

In my own framework, the string management module is using a dedicated memory allocator and a "high level" string API with full UTF8 support from the start.

As a general rule, I think that the C std lib is the weakest part of the C language and it should only be used as a fallback.


Fair, though there's always the crossover point where you need to interact with the OS, 3rd party libraries, protocols, etc. It's not difficult to miss a spot where your utf8 string gets mangled, truncated, etc.

Yes, do not trust the OS, use the minimal API surface, and build your own toolkit, you won't have to do it often.

Is your framework open sourced?

Not 100% decided yet, but it is very probable that I will open source it.

> especially on the memory management side.

Libc string functions don't manage memory. They can be used no matter where your strings are stored. It is more of a choice between generality vs convenience in common cases.


A lot of them require the string to be null terminated rather than taking a length.

I think this is the main culprit of the libc string functions, you have to provide buffers to store results, and the responsibility of managing those individually can be annoying, and bug prone, resulting in vulnerabilities.

Passing an allocator (like Zig) or a container (like in my framework) to anything that needs to allocate some memory to store a result is both explicit, low overhead and quite convenient in practice.


Note that they're annoying, bug prone, and vulnerable in 3rd-party memory managed libraries, too, but you can just say it's someone else's problem then.

You cannot e.g. store a string as a slice of another string (unless the slice reaches the end).

> the C std lib is the weakest part of the C language and it should only be used as a fallback.

I've been musing for a while now: what would it look like if we were to discard the C library and design a new one, leaving the language itself intact?


I think it could be very nice.

C is not perfect, there are some parts of the syntax that I strongly dislike, like casting or function pointers declaration...

But it is overall a good enough syntax, much simpler than C++.


Amending the syntax is fun but rapidly becomes a slippery slope; soon enough you find yourself designing a new successor language, as has been done many times before. Simply scrapping the mostly-unhelpful C stdlib and inventing new, modern abstractions for allocation, IO, text, threading, etc seems like a more tractable problem.

I fully agree.

It has the same fundamental problem, though: you have to rewrite most existing code, which hinders adoption. In this case, it might actually hinder it more than also improving the language itself, since people would be more willing to take that leap if there are more benefits to be had from it.

Stdlib is probably the most successful library in the history, not sure how it is “unhelpful”.

There are several libraries or projects where people have done exactly that.

You often end up with some kind of structure, or variations of structures, for strings:

    struct string {
      size_t length;
      char data[];
    };

    struct string {
      size_t length;
      size_t alloc;
      char *data;
    };
Those are just examples. The tricky part is figuring out the different ownership use cases you want to solve. Because C gives you so much freedom and very little in the standard library, you end up with a lot of variations. You might use reference-counted strings, owned buffers, or string slices, etc. You might want certain types to be distinguished at compile-time and other types to be distinguished at run-time.

An example can be found in the Git source code.

https://github.com/git/git/blob/master/strbuf.h

The history of changes to this file is interesting as well. This is a relatively nice general-purpose string type—you can easily append to it or truncate it.


IMHO that does not solve the main problem, that is individual lifetime management.

I've seen many libs using this style of strings, not convinced by the practicality.


What is "individual lifetime management"?

It sounds like you’re rephrasing part of my comment back to me, or maybe I’m misinterpreting what you’re saying.

If you’re not convinced of the practicality, it sounds like you are simply not convinced of the practicality of doing string processing in C at all, which is a fair view point. String processing in C is somewhat a minefield. Libraries like Git’s strbuf are very effective relative to other solutions in C, but lack safety relative to other languages.


No, I simply am using a different approach, still in C, where strings are simple char*, null-terminated, nothing hidden with magic fields above the base address of the string.

The trick is to pass an allocator (or container) to string handling functions.

If/when I want to get rid of all the garbage I reset the container/allocator.


Yeah, you should have just said that in the first place.

I’ve seen similar approaches, e.g. with APR pools, and if your application can work within those restrictions, it’s very convenient.


You can backport Rust standard library to C using https://github.com/eqrion/cbindgen .

The old MacOS (pre-X) did just that. Strings were all "Pascal strings", ie. with the first byte containing the length of the actual string.

Building blocks for memory were also very different from stdlib, notably the use of Handles, which were pointers of pointers, so that the OS could move a block of data around to defragment the heap behind your back without breaking the memory addressing.


Pascal strings are also kind of bad though. All sub-string operations need allocation, or have to be defined with intermediate results which aren't "really" strings, so in that sense it's not an improvement on Zero-terminated strings. Equality tests are cheaper which is nice, since strings of different lengths compare unequal immediately, but most things aren't really improved.

C++ string_view is closer to the Right Thing™ - a slice, but C++ doesn't (yet) define anywhere what the encoding is, so... that's not what it could be. Rust's str is a slice and it's defined as UTF-8 encoded.


D's strings were defined to be UTF-8 back in 2000. wstring is UTF-16, and dstring is UTF-32.

Back then it wasn't clear which encoding method would turn out to be dominant, so we did all three. (Java was built on UTF-16.)

As it eventually became clear, UTF-8 is da winnah, and the other formats are sideshows. Windows, which uses UTF-16, is handled by converting UTF-8 to -16 just before calling a Windows function, and converting anything coming back to UTF-8.

D doesn't distinguish between a string and a string view.


What’s the ownership story for string views?

They don't own anything. It's just a pointer and length. They don't allocate/deallocate.

I mean clearly something needs to own the buffer for a new string.

Sure, but that's not the string_view's problem, you can't just make string_views, the string you want to borrow a view into needs to exist first.

Imagine you go to a library and insist on borrowing "My Cousin Rachel", but they don't have it. "Oh I don't care whether you have the book, I just want to borrow it" is clearly nonsense. If they don't have it, you can't borrow it.


Walter is talking about D, and he said this:

> D doesn't distinguish between a string and a string view.

In C++ std::string owns the buffer and std::string_view borrows it. If there is no difference between the two in D, then how is this difference bridged?


You can use automatic memory management and not worry about it. Or you can use D's prototype ownership/borrowing system. Or you can encapsulate them in something that manages the memory. Or you can do ownership/borrowing by convention (it's not hard to do).

Automatic memory management makes copies?

No. Another word for automatic memory management is garbage collection.

I guess I should rephrase. Let's say I have a string, which owns its buffer. What happens in D if I take a substring of it? Does a copy of that section occur to form a new string?

A lot of people don't know about this but Microsoft is taking steps to move everything over to utf-8.

They added a setting in Windows 10 to switch the code page over to utf-8 and then in Windows 11 they made it on by default. Individual applications can turn it on for themselves so they don't need to rely on the system setting being checked.

With that you can, in theory, just use the -A variants of the winapi with utf-8 strings. I haven't tried it out yet as we still support prior Windows releases but it's nice that Microsoft has found a way out from the utf-16 mess.


The A-variants had problems years ago, which is why D abandoned them in favor of the W versions.

I don't mind seeing UTF-16 fade away. We've been considering scaling back the D support for UTF-16/32 in the runtime library, in favor of just using converters as necessary. We recommend using UTF-8 as much as practical.


> with the first byte containing the length of the actual string

And the wheels fall off with the first string longer than 255 characters.


Which is why Free Pascal strings are so awesome. I've personally stuffed a billion bytes on one, without issues. They are automatic reference counted, and as close to magic as you can get. You can return one from a function without issue.

However, Free Pascal has the worst documentation of any major project I've ever encountered (The exact opposite of Turbo Pascal), so I can't link to a good reference. Their Wiki is a black hole of nuance and sucks all useful stuff off the internet.


You can fix this issue by using a variable width integer encoding for the size.

It might fix that particular issue, but you still have the same problem that NUL terminated strings have: it's not possible to cheaply create views/slices of a string using the same type.

I remember that era well! During the first few years I used C, I never touched its standard library at all, using the Mac Toolbox instead. This was a common practice, which later carried over into C++.

The creator of this library (antirez) is a regular here on hn.

I believe this is used by Redis.

https://github.com/antirez/sds


Glib answer (but also relevant because mentioned in the article, too): it would look a lot like how a lot of people write C++.

>Glib answer

A Freudian slip, methinks.


How so? First definition I find of glib is "(of words or the person speaking them) fluent and voluble but insincere and shallow", which is mostly what I meant about that answer. There was some sincerity in my answer, but certainly somewhere in the border space of irony and sarcasm, which many people do take as insincerity.


The problem is that it's just too tempting to write something like

    my_function(my_var, 3.6, "bzarflo", my_other_var, false);
The string handling functions are part of the story, but the null-terminated char * is produced when the compiler reaches a string literal, and writing code without being allowed to just use string literals when it's convenient tends to feel like coding with oven mitts on.

It's entirely possible to write a wrapper function with a short name to convert string literals to actual string objects.

    my_function(my_var, 3.6, $("bzarflo"), my_other_var, false);
Isn't that much more of a mouthful, and as long as 'my_function' knows to free it, then you're A-OK! The only trouble is '$()' isn't legal in standard C, so a real solution would have to be something like 'str()'.

This caught my eye the other day & looks quite promising, though I haven't spent much time looking at it so I can't comment on it's memory safety:

https://github.com/tylov/STC


I'm in the WG14 and my opinion is that there isn't one good way to do strings it all depends on what you value (performance/ memory use) and the usage pattern. C in general only deal with data types, north their semantic meaning. (IOW We say what a float is not what it is used for). The two main deviations from that are text and time and both of them are causing us a lot of issues. My opinion is that writing your own text code is the best solution and the most "C" solution. Te one proposal i have heard that i like is for C to get versions of functions that use strings that take an array and a length, so as to not force the convention of null termination in order to use things like fopen.

Has WG14 considered adding slices to C? [1] Introducing slices would naturally give way to a better string library.

[1] https://www.digitalmars.com/articles/C-biggest-mistake.html


Its been 50 years so pretty much everything has been considered. In my opinion the mistake was not having arrays decay in to pointers but rather arrays should be pointers in the first place. An array should be seen as a number of values where with a pointer pointing at the first one. I think adding a third version of the same functionality would just complicate things further. (&p[42] is a "slice" of an array) Another thing I do not like about slices that store lengths, is that they hide memory layout from the user and that is not a very C thing to do.

If you think about arrays as pointers, you will get a lot of things wrong, e.g.

float m[10][10];

it not a an array of pointers, but a 2D dimensional array with 2D memory layout.


You are right, sizeof is the other big difference. I think these differences are small enough that it was a mistake separate the two. The similarities / differences do make them confusing.

How would you express a 2D memory layout with only pointers?

An array of pointers to arrays? Basically, a `T**` C#'s "jagged" arrays are like this, and to get a "true" 2D array, you use different syntax (a comma in the indexer):

    int[][] jagged; // an array of `int[]` (i.e. each element is a pointer to a `int[]`)
    int[,] multidimensional; // a "true" 2D array laid out in memory sequentially

    // allocate the jagged array; each `int[]` will be null until allocated separately
    jagged = new int[][10];
    Debug.Assert(jagged.All(elem => elem == null));
    for (int i = 0; i < 10; i++)
        jagged[i] = new double[10]; // allocate the internal arrays
    Debug.Assert(jagged[i][j] == 0);

    // allocate the multidimensional array; each `int` will be `default` which is 0
    // element [i,j] will be at offset `10*i + j`
    multiDimensional = new double[10, 10];
    Debug.Assert(multiDimensional[i, j] == 0);

Yes, this is people with pre-C99 compilers that do not support variably modified types sometimes do. It is horrible (although there are some use cases).

I plan to bring such a proposal forward for the next version. Note that C already has everything to do this without much overhead, e.g. in C23 you can write:

  int N = 10;
  char buf[N] = { };
  auto x = &buf;
and 'x' has a slice type that automatically remebers the size. This works today with GCC / clang (with extensions or C2X language mode: https://godbolt.org/z/cMbM57r46 ).

We simply can not name it without referring to N and we can also not use it in structs (ouch).


You know what i think about auto :-)

How is this not a quality of implementation issue? Any implementation is free to track all sizes as much as they want with the current standard.

Either a implementation is forced to issue an error at run time if there is an out of bounds read/write and in that case its a very different language than C, or its feature as-if lets any implementation ignore.


Tracking sizes for purposes of bounds checking is QoI and I think this is perfectly fine. But here we can also recover the size with sizeof, so it is also required for compliance:

https://godbolt.org/z/qh7P93Tcd

And I agree that this is a misuse of auto. I only used it here to show that the type we miss already exists inside the C compiler, we simply can name it only by constructing it again:

char (buf)[N] = ...

but we could simply allow

char (buf)[:] =

and be done (as suggested by Dennis Richtie: https://www.bell-labs.com/usr/dmr/www/vararray.pdf)


GLib is an alternative to the standard library.

> The old C std lib is, in my opinion, outdated, obsolete

...and has been since most of us ever used C.

I think one of the major failings of C was the lack of a good standard library that updated with the times.

Actually, I believe a rich standard toolbox was one of the best features of python, and helped with its success.


Which is why most applications ping back into POSIX when available, not that fixes the security issues with the standard library.

Which parts of posix? What can I #include in a posix environment to get better string handling in C?

(Or maybe I’m misinterpreting your comment?)


You are misinterpreting my comment.

First part of my comment relates to C library in general, second part of my comment refers to strings and arrays, even if not explicitly.


"If you are using C and do some non-trivial work with strings you should either use a good library to handle strings or build your own."

   unsigned int str_len(const char *s)
   {
     register const char *t;
   
     t = s;
     for (;;) {
       if (!*t) return t - s; ++t;
       if (!*t) return t - s; ++t;
       if (!*t) return t - s; ++t;
       if (!*t) return t - s; ++t;
     }
   }
I still use this instead of stdlib strlen. Of course I also use software everyday that I know uses stdlib strlen. For most C programs dealing with strings I just use flex and yyleng, which in turn uses the stdlib strlen. Using flex for small jobs is overkill but it's quick and convenient. I am a hobbyist programmer; I write so-called "trivial" programs.

That said, this exact function is used in some "non-trivial" software written by someone else and that person is IMHO a better C programmer than any HN commenter I have seen, most of whom do not let the public see the code they write anyway. Go figure.

NB. I am not the author; this is in the public domain. The author is djb.


I think this naming style should be considered obsolete.

- This function will return the number of bytes, not of characters or codepoints. - str and len are both abbreviations, we should use full words when possible - We can also be more explicit about what the function does, it does not simply returns the string length, it counts characters (or bytes in this case)

Here is how I would name it:

u32 CountBytesInString(char* string); u32 CountCharactersInString(char* string);

And on the implementation side, this work can be done with SIMD instructions, and be really freaking fast, but still, it should be explicit for the user that the work is O(n) complexity, not exactly free.


"- str and len are both abbreviations, we should use full words when possible -"

u32 and char are abbreviations.


Yes, this is true. And this is a tradeoff, I think that basic types are so widely used that we can use this style of abbreviation without much ambiguity.

strlen is also pretty unambiguous, but I still have to check what strstr means.


C isn't Java. Even Niklaus Wirth in Pascal, Oberon, and the like avoided naming their identifiers too long. 'GetStrSz()' is enough to achieve (most of) what you want, assuming certain naming conventions:

- Makes it clear that this returns the number of bytes, assuming a naming convention where `sz` refers to size (in bytes) and `ln` refers to length (in some other unit which would be specified in the type). Note that in C, 'characters' refers to bytes. It's a flaw in how C names its types, yes, but I wouldn't say it should be any different just because other languages do things differently.

- It doesn't use full words because I don't think it needs to. Abbreviations are OK as long as every (invested) party agrees that they're sane, and I think they're pretty sane.

- It makes it explicit that it is performing a calculation (hence, is O(n)) via 'get'.

I don't think all this is necessary, though - I actually think 'strln()' is enough. First, because characters means bytes, I can assume that this function is getting the number of characters (bytes) in a string. I wouldn't expect it to give me anything else! Second, in C, if strings were a struct of some sort, I'd expect to be able to get their length via 'str->ln', which would be O(1). The fact that the length is found through a function in the first place signals to me that it's doing something behind the scenes to figure that out. Remember - that's just my opinion, which I admit is extreme - but I think yours is just as extreme.


Naming is extremely important, and while strlen is a very basic and hardly ambiguous example, consistency is key and I believe that good naming rules should be applied globally or at least at the framework level.

I think that full words and verbs are easier to read and avoid ambiguity.

I guess this is a matter of style and preference.

This anecdote reminds me of the Mutazt type, something I found in a new codebase I was asked to debug. I had to dig for almost an hour to find exactly what this type was.

Turns out it was a char*, a C string. Buried under 4-5 levels of abstractions.

Mutazt = Mutable ASCII Zero Terminal.


Out of curiosity, why do you use this? I expect the builtin strlen() to be even more optimized than this. There's a lot more you can do than a simple loop unrolling.

Especially when there's a decent chance the compiler would replace the loop with a call to strlen.

Don't do this. libc string functions are usually done with hand tweaked assembly or as a builtin by the compiler. They will be faster than this.

This sort of practice is very outdated. The last time it made sense for performance reasons was probably the early 90s or earlier.

Additionally, strlen is not one of the C string functions you want to replace due to defects, the same way you would want to do with crusty old strcpy. If you're working with C strings there is nothing wrong with strlen. (Just don't call it redundantly in a loop body ...)


I've written my own strlen equivalent and benchmarked them against default on different compilers, processors and environments, and they almost always are faster or the same speed.

Default libs are sometimes very optimized but very often they are not, unfortunately.

If you care about performance, you should not rely blindly on the the defaults.

A long time ago I wondered about the performance of memcpy on the Nintendo DS, for sure they would have provided a hand optimized version? And yes, it was handcrafted ARM assembly code, but my own version turned out to be twice as fast.

They simply forgot to use a simple prefetching trick in their implem.


Nintendo DS has probably had a lot less scrutiny than a major libc or recent GCC or clang [though you can probably target its ARM processor with that]. Also, for an older embedded platform they may choose to do optimization for code size rather than cycles or clock time.

I'm going to have to doubt the start of your comment. Having seen a lot of libc implementations I think you are better off not wasting time optimizing strlen. Also memcpy, probably memcpy moreso. Most memcpy()s I've seen in the current century are using SIMD instructions and the like. And compilers don't even bother emitting a call to libc for it anymore, they do it as a builtin.


On the contrary, I expected the Nintendo DS SDK to be well optimized, performance of memcpy can be critical on such a constrained hardware. And it was optimized, just not with the best tricks.

I got the prefetching trick from Intel source code, except that I replaced the PLD instruction by a simple dummy load.

And about strlen, you'd be surprised, some implems are very good, and some are not, depends on the compiler and the library. I've ran benchmarks, I was surprised too.

To be honest, I don't really need super fast strlen, but I was curious and also learning to write fast SIMD code, basic string handling is a nice exercise.


I think this expectation doesn't vibe with my understanding of how people used to think about embedded or consoles. You shipped them and they were done. The games industry was also often trying to ship quickly. Small teams too. Latest tweaks to memcpy or fine tuning or revisiting the finer points of an already adequate SDK is low priority.

By contrast, many more people are updating optimizations to GCC or clang for arm, more frequently and over a longer timeframe.


You're probably right about consoles, and I was surprised to be wrong, but I checked, just to be sure.

GCC and Clang are very nice compilers, but they are a different thing than the std lib. glibc, musl, the Windows C Runtime, iOS, Android, all have different implementations, sometimes outdated.


Gcc and clang are relevant here because memcpy is a builtin. It will not call libc for this.

Same is true of MSVC.

It's been that way on most modern compilers for about 20 years.

That's why I'm saying rolling your own may be futile. Compilers, not just libc, have paid a lot of attention to getting those things fast.


What is true for memcpy (especially on small buffers) is not systematically true for string functions.

But testing is always there to the rescue, and these days we have Godbolt.


Any data to back up that FUD?

Of course I have data, do you think I am pulling benchmarks out of a hat?

But I have not published those benchmarks, if this is what you're asking, the Nintendo thing I am afraid I cannot reproduce easily as I no longer have this devkit on hand.

About the strlen benchmark, this is something I've done a few years ago, that could be easy to run again, but I am not sure this is worth the effort just to convince a random dude on the internet...


> do you think I am pulling benchmarks out of a hat?

Yes. Share the data or you're spreading FUD.


You're being silly. Nothing they said was FUD, it was just a (not particularly controversial) anecdote

Here is the musl stdlib strlen. I do not use glibc.

https://git.musl-libc.org/cgit/musl/plain/src/string/strlen....

   #include <string.h>
   #include <stdint.h>
   #include <limits.h>
   
   #define ALIGN (sizeof(size_t))
   #define ONES ((size_t)-1/UCHAR_MAX)
   #define HIGHS (ONES * (UCHAR_MAX/2+1))
   #define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)
   
   size_t strlen(const char *s)
   {
   const char *a = s;
   #ifdef __GNUC__
   typedef size_t __attribute__((__may_alias__)) word;
   const word *w;
   for (; (uintptr_t)s % ALIGN; s++) if (!*s) return s-a;
   for (w = (const void *)s; !HASZERO(*w); w++);
   s = (const void *)w;
   #endif
   for (; *s; s++);
   return s-a;
   }


glibc strlen:

https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=strin...

   #include <libc-pointer-arith.h>
   #include <string-fzb.h>
   #include <string-fzc.h>
   #include <string-fzi.h>
   #include <string-shift.h>
   #include <string.h>
   
   #ifdef STRLEN
   # define __strlen STRLEN
   #endif
   
   /* Return the length of the null-terminated string STR.  Scan for
      the null terminator quickly by testing four bytes at a time.  */
   size_t
   __strlen (const char *str)
   {
     /* Align pointer to sizeof op_t.  */
     const uintptr_t s_int = (uintptr_t) str;
     const op_t *word_ptr = (const op_t*) PTR_ALIGN_DOWN (str, sizeof (op_t));
   
     op_t word = *word_ptr;
     find_t mask = shift_find (find_zero_all (word), s_int);
     if (mask != 0)
       return index_first (mask);
   
     do
       word = *++word_ptr;
     while (! has_zero (word));
   
     return ((const char *) word_ptr) + index_first_zero (word) - str;
   }
   #ifndef STRLEN
   weak_alias (__strlen, strlen)
   libc_hidden_builtin_def (strlen)
   #endif
   
NetBSD common strlen:

https://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/common/...

   size_t
   strlen(const char *str)
   {
   const char *s;
   
   for (s = str; *s; ++s)
   continue;
   return(s - str);
   }
Apple strlen:

https://opensource.apple.com/source/Libc/Libc-1244.50.9/stri...

Apple strlen comes from FreeBSD. Until 2009, FreeBSD used an unoptimised strlen.

https://svnweb.FreeBSD.org/base/head/lib/libc/string/strlen....

   size_t
   strlen(str)
           const char *str;
   {
           const char *s;
   
           for (s = str; *s; ++s);
           return(s - str);
   }

FreeBSD eventually copied^1 NetBSD's x86_64 strlen.

1. "modeled after", "inspired by", etc.

https://svnweb.FreeBSD.org/base?view=revision&revision=18770...

   #include <sys/cdefs.h>
   __FBSDID("$FreeBSD$");
   
   #include <sys/limits.h>
   #include <sys/types.h>
   #include <string.h>
   
   /*
    * Portable strlen() for 32-bit and 64-bit systems.
    *
    * Rationale: it is generally much more efficient to do word length
    * operations and avoid branches on modern computer systems, as
    * compared to byte-length operations with a lot of branches.
    *
    * The expression:
    *
    *      ((x - 0x01....01) & ~x & 0x80....80)
    *
    * would evaluate to a non-zero value iff any of the bytes in the
    * original word is zero.  However, we can further reduce ~1/3 of
    * time if we consider that strlen() usually operate on 7-bit ASCII
    * by employing the following expression, which allows false positive
    * when high bit of 1 and use the tail case to catch these case:
    *
    *      ((x - 0x01....01) & 0x80....80)
    *
    * This is more than 5.2 times as compared to the raw implementation
    * on Intel T7300 under EM64T mode for strings longer than word length.
    */
   
   /* Magic numbers for the algorithm */
   #if LONG_BIT == 32
   static const unsigned long mask01 = 0x01010101;
   static const unsigned long mask80 = 0x80808080;
   #elif LONG_BIT == 64
   static const unsigned long mask01 = 0x0101010101010101;
   static const unsigned long mask80 = 0x8080808080808080;
   #else
   #error Unsupported word size
   #endif
   
   #define LONGPTR_MASK (sizeof(long) - 1)
   
   /*
    * Helper macro to return string length if we caught the zero
    * byte.
    */
   #define testbyte(x)                             \
           do {                                    \
                   if (p[x] == '\0')               \
                       return (p - str + x);       \
           } while (0)
   
   size_t
   strlen(const char *str)
   {
           const char *p;
           const unsigned long *lp;
   
           /* Skip the first few bytes until we have an aligned p */
           for (p = str; (uintptr_t)p & LONGPTR_MASK; p++)
               if (*p == '\0')
                   return (p - str);
   
           /* Scan the rest of the string using word sized operation */
           for (lp = (const unsigned long *)p; ; lp++)
               if ((*lp - mask01) & mask80) {
                   p = (const char *)(lp);
                   testbyte(0);
                   testbyte(1);
                   testbyte(2);
                   testbyte(3);
   #if (LONG_BIT >= 64)
                   testbyte(4);
                   testbyte(5);
                   testbyte(6);
                   testbyte(7);
   #endif
               }
   
           /* NOTREACHED */
           return 0;
   }

Something worth considering: Sometimes people may use older versions of compilers, e.g., older versions of GCC, for compiling older programs for older hardware. These GCC versions are smaller in size and written to run on less powerful hardware. For example, the gzip'd source tarball for GCC 2.95 from 2001 is 12M while the one for GCC 12.2 from 2022 is 143M, an 11x size increase.

Do you have any recommendations for good open source C standard library replacements instead of rolling your own string manipulation functions?

> use a good library

Such as?


> Our last function is strcmp. It looks at two strings and determines whether they are equal to each other or not. If they are it returns 0. If they aren’t it returns 1.

No it doesn’t.

    RETURN VALUES
         The strcmp() and strncmp() functions return an integer greater than, equal
         to, or less than 0, according as the string s1 is greater than, equal to,
         or less than the string s2.  The comparison is done using unsigned
         characters, so that ‘\200’ is greater than ‘\0’.

I've added a footnote to my incorrect explanation and credited you. I'm still a C noob so thank you for pointing this out!

Reminds me of the incorrect cast of memcmp() return value that resulted in this bug: https://bugs.mysql.com/bug.php?id=64884

I don't have MacOs to prove it but I believe `strcmp` on MacOs returns either 0, 1 or -1


strcmp's return value is loosely defined because some implementations return the difference between the characters in the string to avoid some conditional checks or jumps. Something like:

  int d = a[i] - b[i];
  if (d == 0) return d;

It currently returns a character difference. Don’t rely on it, though!

Just a pedantic comment, but ???? is arigatou or roughly "thanks", not "hello". Hello would usually be ????? or, more confusingly, ???

????

Sort of unfortunate, because there's really no good translation for "hello" into Japanese - you'd say ????? in the morning, in the afternoon ????? and ???? when answering the phone...

you'd say ohayogozaimasu in the morning, konnichiwa in the afternoon and konbanwa in the evening.

I don't know how to type japanese on my phone, but the first is literally "early" surrounded by honorifics. The characters for konnichiwa means "this/now", "day" and the "wa" at the end is an article making the previous phrase the subject of the sentence. Same with konbanwa, but for evening instead of day.

no idea on the etymology of moshimoshi for answering the phone, though.


According to an article I read, moshimoshi came from the telephone operators saying ???? (moushi moushi) to signify "I'm going to start speaking now". On a tangential note, ?? used to be the phrase when calling out to someone to ask something (similar to English "Excuse me").

The worst part of C strings is that they tend to show up in APIs (especially system calls). This make interoperability with other languages harder than it should m

This is why I hate them too. You can use a custom length + pointer type for representing strings in your own code, but interfacing with other libraries and the OS almost always requires having a null-terminated string. It forces you to make copies just to tack on the null terminator.

`strlcpy` is the function you probably want. but again it is not standard. https://lwn.net/Articles/507319/

I think the reason people don't want to standardise this kind of function is it often gives wrong behaviour. for example if you are trying to copy a string into a fixed buffer and its too long then often it is an error or potentially even a security bug to truncate it. so these functions generally do the 'wrong' thing even though they are 'safer'. if you are dealing with static buffers then I think you should be explicitly checking the source fits in the target and then handling the error case. you could even have a function like `strlcpy` that does `strlen` then checks if it fits, then does the copy or return an error code. alternatively, if the string should always fit and you don't want to handle the error case then the safe thing to do is check at runtime that it fits then abort the program if it doesn't fit.


strlcpy is not needed, strcpy_s (not strncpy_s) is safe and is part of the C11 standard.|

In fact, strlcpy is worse:

* strlcpy truncates the source string to fit in the destination (which is a security risk)

* strlcpy does not perform all the runtime checks that strcpy_s does

* strlcpy does not make failures obvious by setting the destination to a null string or calling a handler if the call fails.


> strcpy_s is part of the C11 standard

an optional part, which makes it pretty worthless, if it were not so already.


On systems that aren't memory constrained, we just shouldn't be using static buffers at all. Just always use something like asprintf() and free() the result when you're done. No, it's not in the C or POSIX standards, and that's a shame, but it's at least available on Linux and the BSDs.

I end up working on a lot of code that uses Glib, so I tend to use g_strdup_printf() a lot, which works the same as asprintf().

Ultimately the cost of allocations is usually not a big deal, and you gain a lot of safety. Sure, you then have to remember to free(), but I'll take a memory leak over a segfault (and its possible security consequences) any day.

And if allocation cost is a problem, you can always go back and optimize with static buffers later. That shouldn't be the default that people reach for, though.


No, it’s not. The return value it provides is generally unwanted.

For initial string input, i.e. from a network/file/terminal stream, using fgetc and/or fgets plus code to verify and sanitize makes the most sense IMO.

This does mean you have to write a lot of C code for what would be simple tasks in other languages, e.g. a correct file open, read-to-dynamically-allocated-memory, and file close with good error checking is a full page (at least) of dense code in C and just two lines in Python.

If you've done a good job sanitizing and verifying all the input to your program, only then does it becomes relatively safe to use the standard string functions, with caveats for multithreading.

Asking ChatGPT to compare and contrast fgetc and fgets is a good place to start, and then ask how to use fgets to handle errors during stream I/O, and what can go wrong with multithreading etc. Then take a look at the sqlite source code for in-house C-string handling, here's the take-away comment:

"Because there is no consistency, we will define our own."

https://github.com/sqlite/sqlite/blob/master/src/util.c


It's important to mention that strncpy (and also strncpy_s) are really not a strcpy replacement, it's not intended for the same usages. The name is a total misnomer. Do not use strncpy that way!

In any case, strcpy_s (which is a good replacement for strcpy) is part of the C11 standard. I'm confused how that isn't considered portable.


it's an _optional_ part of the standard, and so can't be relied on. also, the idea behind it is pretty poor.

Don't GCC, Clang and MSVC provide it? It may be optional in practice but if the major compilers support it, it's not really an issue.

The idea may not be perfect, but for C which is intended to be low overhead, strcpy_s is about as good as it gets. If you want something more user friendly, that is what C++ is for with std::string, or library implementations like Boost or QT string.


MSVC supports it, as it is an MS invention, designed by some intern, i guess. other compilers may or may not, by switches/#defines. it is worthless in any case.

>as it is an MS invention

Source, and why does it matter who made it?

>designed by some intern

You don't know that, nor is that how standards work.

>other compilers may or may not

So you've said basically nothing.

>it is worthless in any case.

It offers a low-overhead, safer alternative to strcpy. Is it perfect? No. But it's one of the better C options for those limited to the standard library.


it was put forward to the standards comittee by MS, who have a powerful voice there. no-one else wanted it, which is why it ended up in an annex of the standard. it is badly designed, and does nothing that you can't do yourself, and should be doing yourself, in any well-written code.

Yeah, I was a little confused by the strncpy "gotcha": "The answer is that the destination gets filled with all the characters of the source string with no room left for the null terminator."

Well, I mean, the docs specifically call that out: "If src is less than len characters long, the remainder of dst is filled with ‘\0’ characters. Otherwise, dst is not terminated."

So if you want null termination in all cases, you need to pass len-1, not len.


Programs which have to deal with C strings beyond the bare minimum that libc provides will generally have a set of routines for making it more ergonomic. e.g.:

https://github.com/git/git/blob/master/strbuf.h


This is from a C fan: If you are going to do any string heavy work, please use anything else than C (Python is pretty nice for this sort of stuff for instance).

And if you need to use C anyway, then please use anything else than the string functions from the standard library. The C stdlib is (mostly) a leftover from the K&R era when opinions about what makes a good API were very different from today, and C was a much 'harsher' language.

C is pretty nice for a lot of things, but working with strings definitely isn't one of them.


I would say, unless there's a performance reason not to, always use asprintf for every string operation.

For string heavy workload, C is ideal, provided that you don't use C string functions.

You can always allocate a very large buffer and do your string operations there, using memncpy and the assorted functions which can be inlined in many architectures and be really fast.

Then you can dispose of the buffer really quickly with one call or reuse it for later operations by simply setting a few pointers to initial status...


If you have large data sets and need maximum performance I agree, but a lot of day to day string processing works on small data sets and isn't very performance-sensitive.

As a newcomer to C, why is it that the C standard library doesn't get updated? Newer languages seem to place a lot of emphasis on getting their standard libraries as useful as possible. It's odd to be told not to use the standard library functions but to write my own instead. I'm really doubting I can just sit down and hammer out string functions superior to string.h.

Just my guess: C doesn't depend as much on its standard library as other languages (which has its good and bad sides), and this also means that fixing the standard library isn't very high on the priority list of the C committee, because everybody knows that for any serious work the stdlib isn't suitable anyway.

The C stdlib is basically the lowest common denominator which enables writing very simple UNIX-style mostly-cross-platform command line tools, but not much else. For anything serious you either call OS API functions directly, or resort to specialized third-party libs.

Attempting to 'fix' the stdlib would first mean agreement on what such a stdlib should actually contain and look like, and this has a real risk of ending up in C++ Commitee style busywork (e.g. lots of activity with little to show for).


String handling in C has many gotchas indeed. Here are some of my notes on the subtleties:

https://www.pixelbeat.org/programming/gcc/string_buffers.htm...


> strcmp takes two strings and returns 0 when they are true.

ITYM “equal”, not “true”.


[dead]

> So how can we handle this case safely? There are a few ways I can think of.

strdup:

> The strdup() function returns a pointer to a new string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3).


wchar_t is a massive landmine that should never be used since its size varies by platform. The locale of the compiler has to match the end user for L prefixed strings to work correctly. Likewise char16_t and char32_t are just swimming against the easy path at this point. You're much better off sticking to UTF-8 and using the C11 u8 prefix on literals so you can use the regular string API and never have to worry about locale settings.

This is great advice! I wasn't aware of this and I will keep that in mind. When I first came across Unicode literals I was unsure when exactly you would use them over wchar_t

I once got called in to fix an SS7 stack suffering from poor performance. Pretty well written, and not obvious at first sight why it was going slow. Most of it was low-level bit fiddling, and some small strncpy's() - generally about 8 chars or so.

Didn't take that long to profile (well, printf's as no profiling available) and figure out it was the strncpy's causing the problem, but why? Well, there was a handy 8 megabyte buffer used for working memory that the strings were being copied into that for modification.

From the strncpy() man page:-

>If the length of src is less than n, strncpy() pads the remainder of dest with null bytes.

Ah. So every little strncpy was essentially copying the string then zeroing out 7,999,992 bytes. And there were lots of little strncpy's...


The appearance of strncpy() in any source code is an immediate panic attack for me. It should never be used, and if it is used, it should be removed.

Similar rule for sprintf(), all instances of which should be replaced by snprintf().


Unfortunately there often isn't a better replacement in your standard library (embedded systems are weird). I ended up using strncpy followed by automatically setting the last byte of the string to null.

strncpy() in particular is so bad that you're better off (for a rare exception to the rule) just writing your own, that does what most people think strncpy() does (or should do) rather than what it actually does.

Wrap memccpy.

snprintf has very similar performance pitfalls.

no, because the size argument is only an upper bound on how many bytes can be written into the destination.

   snprintf (huge_buf, huge_buffer_size, "%d", 1);
will write two bytes into huge_buf, regardless of huge_buffer_size (assuming it is 2 or larger).

It’s in the other direction, snprintf(small_buf, small_size, “%s”, huge_string) will need to iterate the whole string.

why? snprintf() will just write as many bytes from huge_string as necessary, up to the smaller of small_size and strlen (huge_string).

what makes you believe it will iterate the whole of huge_string?


Quoth the standard:

> The snprintf function returns the number of characters that would have been written had n been sufficiently large, not counting the terminating null character, or a negative value if an encoding error occurred.

In essence it needs to return strlen of huge_string even though very little of it was actually written.


Well that's pretty fucked up. I note that the GNU C library docs say this:

> Attention: In versions of the GNU C Library prior to 2.1 the return value is the number of characters stored, not including the terminating null; unless there was not enough space in s to store the result in which case -1 is returned. This was changed in order to comply with the ISO C99 standard.

ISO C99 needs a kick in the head. Yes, there is a use case for this return value (buffer wasn't large enough, reallocate and try it again). But wow, own goal team!

Thanks for this. I had no idea that C99 had defined this so stupidly. I do see that in 2004, the linux kernel added replacements (scnprintf() which behave as the pre-C99 versions of snprintf generally did). There's a good discussion of this here: https://lwn.net/Articles/69419/


Literally 25 years ago I was a beginner programmer and tried writing a .dll for Microsoft's Internet Information Server, which was relatively new at the time. (I hadn't so much as seen a Unix-based OS at the time, let alone understood CGI). C strings were mind boggling and frustrated me so much I simply gave up. Happily around the same time, MS introduced Active Server Pages and I was able to use that and never messed with C again. It's amazing the same issues still exist decades later.

That is the most mind-boggling part of this saga to me. People have been using C since the 1970s. It's now 2023, and there still isn't an obvious solution to this other than suggestions that every team should write their own string library from scratch.

And apparently it all started with some genius deciding that using a single 0-byte at the end is so deliciously efficient and therefore obviously the way to go. We can't waste 4 bytes for the string length, that's out of the question. I think only the Pascal solution of having a single byte for the string length is worse.


> At first this looks great, but there is a problem. What happens when the source string minus the null terminator is as long as the size of the destination string? The answer is that the destination gets filled with all the characters of the source string with no room left for the null terminator.

The 'n' in strncpy is mainly there to help you avoid overrunning the destination, it does not guarantee whatever makes in there is null-terminated.

This is why you should always explicitly set the last byte to zero after using strncpy (and never ever use strcpy).

  char dest[16];

  strncpy(dest, src, 15);
  dest[15]=0;

Related and good read about strcpy in the kernel: https://lwn.net/Articles/905777

> "But for real if anyone knows how to get this to work on Windows 10 let me know!"

Since the May 2019 update, Windows 10 has supported declaring the code page in a manifest file.

In Visual Studio, you must add "/utf-8" to the compiler command line, this makes it parse the source code as a UTF-8 file, and makes it output UTF-8 string literals.

To make console output work, call the Win32 function "SetConsoleOutputCP(65001);"

To get support for opening files with names that aren't in your system codepage:

* Create a manifest file as shown in https://learn.microsoft.com/en-us/windows/apps/design /globalizing/use-utf8-code-page

* Add this as an "Additional Manifest File" in Visual Studio project settings for the manifest tool

Additionally, there is an undocumented NTDLL function "RtlInitNlsTables" that sets the code page for the process. It is difficult to use without a lot of example code, but some app locale type tools (used to change locale for a process) make use of this function.


Programming in Go made me a better C programmer, because now I no longer use C strings, only a buffer/length/capacity struct.

Meh. The w_char stuff is barely C's fault. You use wide (constant width) characters then set the terminal encoding to utf8 (variable length encoding). What did you expect? It's a windows issue. I can copy paste all sorts of utf8 in "normal" string literals, printf and puts them, and it just works in my terminal.

RE counting characters: this is a whole can of worms. Do you want to count grapheme clusters? Code points? Anything other than just the amount of bytes? Use a unicode library.

The latter part of this article is a bit like those articles that make fun of javascript for having floating point numbers behave like, gasp, floating point numbers.


As a cellist, I was about to sympathize when I read the title.

"We're not in Kansas any more, Toto"

Or to paraphrase that "We're not in Python any more, and C is not Python".

You know what sends me insane? Indentation and lack of fixed types in Python. But I don't have problems with C strings. Because I have grown to love and know C's string foibles just like the author will certainly not be driven insane by 'Python's shortcomings according to me'.

The world is full of people who complain that something or other is different from what they know, so that 'other' is wrong. That's just being isolationist. Everything has its own advantages, its own disadvantages. Let's accept that and move on, instead of making mountains out of mole-hills.


> Indentation and lack of fixed types in Python.

Whenever I see someone complain about Python's indentation, my brain internally translates it to "I poorly format my code."

If you code is properly formatted, then Python's indentation is never a problem. I praise Python's indentation-as-syntax because it prevents issues like a dangling else or a forgotten brace while also making proper formatting a requirement for your program to run.


Let's be clear about this: python indentation-as-syntax makes one particular style of formatting a requirement. Not everybody likes it. You might say I poorly format my code in other languages, but most of the time for readability you are somewhat stuck with the formatting of that code base which turns out to be a non-problem thanks to modern editors (sarcasm) like vi that can do brace matching and are older than python.

Python's rigidity on formatting should solve that problem, but it really doesn't and over time it relaxed the rules somewhat which made it better.


> then Python's indentation is never a problem

I'm sure you have some solid 3rd party research to cite to back up that absolutist claim. Or have some explanation that a it with my workmate's fault and not a git merge (which mixed up indenting between the two files) that introduced a bug that took hours to track down. Or why people like me loved the idea of meaningful indentation in python, then grew to not love it any more after experience.

Or the bitch that it can be when you have to generate code and have to do more than just slap braces around it to delimit blocks.

All in all, another one of those pure-opinion HN posts that are becoming too common.


You've missed the GP point by so wide a margin you could have as well shot in the opposite direction. You're basically doing exactly what they asked you not to.

People are different, work in different ways, are productive with different techniques, have their own habits formed over decades that they don't feel the need to change. None of that means they're worse than you, who, of course, always properly formats your code.

It's perfectly possible, and not that rare, to be careful with indentation and not liking significant whitespace (esp. if the rules are not overly consistent, like in Python). It's also possible to feel like the need to manually adjust indents after moving the code around is a distracting chore that braces largely eliminate. It's possible to work exclusively within an IDE that will make "dangling else" problem impossible to happen, therefore it's possible not to see that as a problem. And so on.

Keep an open mind. Try to get accustomed to various style of working with code. Stop your brain from being discriminatory, and don't assume too much about what other people consider "good" or "poor". If in doubt, ask politely.

BTW: I'm using Python since 2.5.2 professionally. Just a quick disclaimer.


The biggest issue is python's indentation stuff doesn't work in a function parameter list, so you get stuck with single line lambdas only or defining a temp named function before using then in an unnatural way.

That’s a fair point, but of the over twenty programming languages I’ve used in my career, only C uses null terminated strings. All the others store the length. There’s good reasons for that. I think C strings are objectively bad and error prone.

Indentation and lack of fixed types aren't responsible for over 50% of known security issues in software.

C's issue are not just harmless foibles. They cause real harm to the poor people actually using the software.


> Indentation and lack of fixed types aren't responsible for over 50% of known security issues in software.

there is no way of fixing this with a "string type". The errors come from IPC/Internet, and there will always be just a sequence of bytes, and some length, maybe, given by the user. Somewhere some code will have to trust this length, or compute a length.


There is no practical advantage to null terminated strings though.

It's not that they are "different", it's that they are extremely error prone and have poor performance for certain operations such as getting the length.

>But I don't have problems with C strings

Everyone thinks they are clever enough to use them and other parts of C without problems, and those people are the most dangerous.


You've grown to love the footguns and hundreds of thousands of security holes that null-terminated strings have introduced over the decades?

It's not so much a question of different is bad, it's that having one of the six positions for your car's stick shift be marked 'Self-destruct' is... Sub-optimal. I'm sure you're smart enough to operate that car safely, but the ditches seem to be filled with burnt-out husks.

Tab-based, versus curly-brace indentation, on the other hand, is a question of how you want the car painted. Purely personal taste.


> It's not so much a question of different is bad, it's that having one of the six positions for your car's stick shift be marked 'Self-destruct' is... Sub-optimal. I'm sure you're smart enough to operate that car safely, but the ditches seem to be filled with burnt-out husks.

Good analogy, but I think you're being to forgiving to C. It's more like having all 6 out of the 6 positions of your car's stick shift marked as self-destruct. If you don't want the car to self-destruct while changing gears, you need to tune your FM radio to 99.0 Mhz and quickly set your turn signals to left, right and left again before shifting. And that only works safely when shifting into the 1st, 2nd and 4th gears, unless you modify your car engine sot it can only drive on Microsoft roads.


On a Python car, the car won't even start if you don't have your cosmetics right, and they have to be of a specific color.

I'm pretty sure that my C program either won't compile, or won't do what I want if I treat curly braces and semicolons as you are treating python indentation.

They serve the exact same purpose, and are both necessary, but the choice of braces versus whitespace is purely aesthetical.


K&R contains this beautiful koan-like string copy code:

    while (*t++ = *s++)
        ;
Honestly the elegance of this thing was one of the hooks that made me fall in love with C. But this was from a now-forgotten age of innocence, as there are so many "nopes" around this line-and-a-half that one would, rightly, be tarred and feathered for ever putting it in a program today.

Could you explain why this line should be discouraged? I'm a beginner in C, so I really don't know. That's why I'm asking.

while (t++ = s++) ;

You're assigning a char to another, relying on the return value being 0 to detect end of string.

You're performing the copy while also increasing the pointers with ++ in the same expression.

You're using the cryptic ; empty statement to signify nop, thereby confusing newbies.

Etc


Equivalent of a strcpy() there's no bounds checking.

Another reason, in addition to other replies and separate from safety concerns, is that strcpy, memcpy etc are nowadays usually implemented via more efficient compiler intrinsics rather than an explicit loop.

> By default, Windows PowerShell .lnk shortcut is hardcoded to use the "Consolas" font

Surely this is not the case for Japanese versions of Windows (or users with Japanese set as their display language?)


Yes, you can actual look at the full list from `HLKM\Software\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont` (taken from my copy of Windows 10, bracketed comments mine):

    0       Lucida Console
    00      Consolas
    932     *MS ???? [MS Gothic for Japanese]
    936     *??? [Simsun for simplified Chinese]
    949     *??? [Gulimche for Korean]
    950     *??? [Windows MingLiU for traditional Chinese]
Note that there are actually two global defaults which only are differentiated by leading zeros. This is intentional and can be used to enable additional fonts; it is a common tweak for Korean (and probably other CJK) users to add a preferred font with a name 0949 or 00949 etc.

Japanese versions use CP932 though, which according to your list would use a font that's not Consolas (MS Gothic in this case).

I should have said "no" in place of "yes" (being a non-native speaker, I completely overlooked "not" in your original reply), otherwise I believe I didn't contradict you.

How to make C safe: by putting it back in the box and putting the box back on the shelf and then closing the door to the garage. Then using a safe-by-design language.

It's strange that computer programmers think of themselves as being on the cutting edge of technology, but then we use a language that is over 50 years old. Of course there are going to be lots of problems with C strings since they were designed for a totally different world (no Unicode, no security issues, memory was precious, etc). The hardware is a million times as powerful but the software environment improves at a glacial pace.

Pop quiz, which of these is safe, given "char buf[80]" and arbitrary user input in argv[1]?

    gets(buf);
    scanf("%s", buf);
    strcpy(buf, argv[1]);
    scanf("%80s", buf);
    strncpy(buf, argv[1], 80);
    snprintf(buf, 80, argv[1]);
----

The delightful answer is none of them. The first three have no bounds checking at all, meaning that they will happily overflow the buffer to an arbitrary extent (gets, at least, will usually trigger a warning on modern compilers). The next two have off-by-one errors: scanf will write a NUL byte out of bounds (and that's exploitable! https://googleprojectzero.blogspot.com/2014/08/the-poisoned-...) while strncpy will fail to NUL-terminate the string. The last one uses the right buffer length, but treats user input as a format string and can leak memory contents or produce arbitrary memory corruption with the %n format specifier.

C string handling practically invites off-by-one errors and horrible security practices out-of-the-box.


Are you sure that strncpy does an out-of-bound write here? I believe it doesnt, but would give you an unterminated string in buf which is also... less than ideal (if the input is 80 non-null characters or longer).

Ugh, I had that in my comment before I "refactored" it. Fixed now, thanks for pointing it out.

Yeah, and I wouldn't say it's definitely unsafe. You can memchr a '\0' out of it (or not) to determine if a null terminator got in there or not.

I mean, you could but a lot of people do not. Plus, strncpy is terribly inefficient if the source string is tiny, because it’ll fill the rest of the buffer with NULs.

One mans inefficiency is another mans resistance to timing attacks :P

(this is tongue-in-cheek, the function is still bad IMO because it almost never does what you need it to. If it guaranteed null termination it would be more useful)


This is a very thoughtful post. Real question: Is the solution to avoid C strings entirely? Use something like a Pascal string that includes the length?

Yes. For the love of god, a thousand times yes. Every other language, even some of the "C-compatible" ones uses explicit-length strings.

    snprintf(buf, 80, “%s”, argv[1]);
Should work.

As long as you replace the 'smart'-quotes with actual quotes.

-Emily


Bingo, you should never pass arbitrary strings where they could be used as format specifiers, it's like running arbitrary code. Some compilers even issue warnings when you pass non-literal format strings to the printf family.

> The delightful answer is none of them.

No. Sorry. This is bad programming. C'mon.

I started programming back in the 8080, 8085, 6502, etc. days. I had to program some prototype computers using a hex keypad while entering raw machine code (not even assembler). I still own a couple of these:

https://i.imgur.com/ZsIJj1p.png

In a couple of cases I had to take this approach to bootstrap Forth on a 6502, then write a full Forth code editor and finally write the robotics application from there.

Do not confuse bad programming or lack of knowledge with something attributable to a language, any language. A knowledgeable software developer, among other things, stays clear of these issues. This is also the value of experience and exposure to a wide range of technologies.

It's like blaming MicroPython for a machine getting destroyed because garbage collection interrupted a critical real time process. There's nothing wrong with MicroPython in that regard, the programmer/designer of the embedded system either lacked knowledge and understanding.

Part of the problem, as I see it, is that a good deal of modern university CS degrees don't even touch low level stuff. They start students on languages like Javascript and Python. These are fantastic, however, someone with deep-rooted experience in these languages who jumps into C is very likely to do some truly horrific things. The language isn't the problem, at all.

I mean, not to go too far, the Linux kernel is written in C. Right? It's about the person, not the language.


One could take a glance at these and easily believe that they do the right thing. I don’t think that no one would accidentally miss such a small error from time to time.

Of course, and the more experience you have the less of this will happen.

I've been writing software in over a dozen languages for over 30 years. Generally speaking, when I write code, even complex code, in any language, it just works. Not because I am something special. I have done a lot of of work across a wide range of application domains and have made my share of mistakes over the years.

Of course I make mistakes. Everyone does. Yet these mistakes. They have nothing to do with lack of domain knowledge. People who approach C without having a clue as to how memory, registers and the internals of a processor and memory system work are going to create bad code.

Blaming the language, the tools, is irrational. You can write perfectly good, safe and performant code in assembler. And boy, can assembler be a minefield in the hands of someone without experience!

Are we going to blame the processor microcode then? No, of course not.


So, there are no bad tools, no bad languages, ever? A chainsaw without hand guard is a fine tool, and the blame is on whoever got their own arm chopped?

I find that dubious, to say the least. Languages are not all equal, obviously. You can take an existing language and make it worse by removing some useful features of degrading existing ones; so why couldn't you make it better? In the case of C, its string handling has been proven time and time again to be a (collection of) footgun.


> A chainsaw without hand guard is a fine tool, and the blame is on whoever got their own arm chopped?

I think you are stretching it. Still, let's go with it.

I have done a ton of construction work in my life. From large projects at home ($200K-ish) to managing the build of a $12MM data center I designed. Because of this I have been around construction guys of all kinds and skill levels. And, of course, I have a lot of personal experience doing the work as well, from carpentry to just-about anything in a typical home or commercial project.

Anyhow, I always cringe when I see experienced construction guys work with modified tools that have had safeties removed to make the work go faster. One example of this was when I watched these guys cutting concrete blocks with a handheld grinder. They had removed the guard that typically covers half the blade. The entire blade was fully open and spinning at 10K+ RPM. When asked they said they'd been doing it this way for twenty years, it's faster, they can see the cut and control it far better. Still had all fingers.

Same is true of guys cutting framing lumber with circular saws or skillsaw's while propping-up the pieces with their bodies.

To me, someone with not even 10% of the experience they have, that was unthinkable. I would have lost fingers and limbs. I would have ended-up in the hospital almost instantly and possibly take others with me.

It's a relative term. Are the tools bad? Well, when experienced professionals can use them safely day in and out (this is their job, they've been doing it this way every day for twenty years), can we really blame the tool of I grab it and proceed to remove a finger or three?

No. Of course not. I know the American system of liability doesn't work that way, but that would be and should be 100% my fault for not having the experience necessary to approach such a thing safely.

It's the same thing, it doesn't matter if we are talking about coding, CNC machining or downhill skiing. Newbies love to blame the skis for what they did wrong, or the $150K CNC machine for crashing the $10K spindle into the table. It's never their fault. Sure.


>Do not confuse bad programming or lack of knowledge with something attributable to a language, any language.

In C everyone starts out as a "bad programmer". In other languages people are merely inexperienced.

>however, someone with deep-rooted experience in these languages who jumps into C is very likely to do some truly horrific things. The language isn't the problem, at all.

You are contradicting yourself.


> In C everyone starts out as a "bad programmer"

In everything in life one starts out as a "bad <X>". I would be a bad free diver (I'd probably kill myself).

People inexperienced in <X> lack knowledge in <X>. That is not an insult. That's just reality. One can do some pretty serious mistakes as an inexperienced Python programmer (example: async/await) or downhill skier.

It's not the language, it's lack of knowledge and experience.

It's not the ski's, it's lack of knowledge and experience.

Are we now in a culture where saying that someone is doing <X> badly because they don't have experience is an insult? OK, great. Let's blame everything else, except for lack of experience. C is the problem. Please don't use it.

It will be very interesting to watch as the "not my fault/blame everyone but me" crowd faces having justify their lack of skills with what tools like ChatGPT will evolve into. Very interesting. I guess we'll blame LLM's for not knowing <X> well enough to be hired.


Funnily, GPT-4 seems like it generates pretty bad C code but pretty good Python code.

The C code will have silly things, like bad style or `double d = malloc(sizeof(double))` (instead of `double*`), which makes it evident that its training data was full of pretty bad C code. Which makes sense since most C code out there, like on StackOverflow, is bad. Same with Bash code.

The worse quality of code available in these langs suggest to me that these langs are inherently more difficult, which means people are more likely to be bad at them.

Whether they deserve blame for that, or whether it disqualifies them as legitimate technologies, is subjective. Objectively, though, you're accepting a higher rate of failure by using them over less difficult alternatives. If "good <X>" colloquially means "<X> with high likelihood of generating desired outcomes" and "bad <X>" means "<X> with high likelihood of generating undesired outcomes", I think it's fair to call both "bad langs". ;p


> Funnily, GPT-4 seems like it generates pretty bad C code but pretty good Python code.

Back when OpenAI had code-specific models based on GPT-n they were generally specifically advertised as best at Python; I suspect their coding-related training data and human feedback on coding tasks all favors Python by a significant amount (and I supect that that actually gets reinforced by positive feedback, since this makes it most likely that they get used with Python over time, too.)


:o Huh. I guess it's not a fair sampling of code out in the wild then.

> Funnily, GPT-4 seems like it generates pretty bad C code but pretty good Python code.

It's only a matter of time. And likely not a lot of time.

I asked ChatGPT to write a fast CRC-16 calculation algorithm in ARM assembler given a set of preferred registers and other constraints. I compared it to my own code, written a while back. Not too bad.

It wasn't clever about using assembler tricks experienced assembler coders understand, yet the code passed my test suite. My code was much faster because it was written with the benefit of experience that had me reaching for optimizations ChatGPT did not.

The interesting part was when I asked that it modify the code to work with a different buffer structure and be able to compute CRC-8, CRC-16 and CRC-32 with various modifiers.

It did it in just a few seconds. The code passed 100% of my tests. Not super fast or efficient, but it worked. I remember when I had to do that myself with my own code, it took over a day.

This is today, mid 2023. Give it a year or two (maybe less?) and it will be a tool to contend with. People who like to blame everything else rather than their lack of knowledge and experience will not do very well in that world.

Why would I pay someone to do <X> when they bring nothing special to the table?

Here's the huge paradigm shift (at least for me):

I could not care less what someone knows or does not know. I care about the range and breath of their experience and how they approach learning that which you do not know.

Someone like that can use any available tool, including AI tools, to deliver value in almost any domain. Someone who blames others (tools, people, the system, whatever), cannot.

We might just be entering an era in which experience will be shown to have serious value.


> A knowledgeable software developer, among other things, stays clear of these issues.

This is the "no true scotsman" fallacy.

Languages can be designed so that less than perfectly knowledgeable programmers fall into the pit of success, or they can be designed so that they fall into the pit of failure.

For people making your argument, I like to provide this challenge: Go take a flight on a 737 MAX that hasn't had its MCAS fixed/disabled. That should be fine, right? After all, no "true" pilot ought to disregard one sentence on page 437 of the flight manual that they weren't even given during a 1 hour training video. A true professional pilot memorises the engineering blueprints, the source code of the avionics, and the wiring schematics, surely. So you have nothing to fear! The plane is "safe", and pilots can be trusted to be knowledgeable.

Go buy that ticket.


> This is the "no true scotsman" fallacy.

Sorry. Not even close. Source: I actually studied Phisolophy/Logic at Uni. Good try though.

Also, your aircraft example is absolutely ridiculous.

This isn't an appeal to purity at all. This is about domain knowledge and experience.

A more appropriate example might be the contrast between someone who has only done 3D printing now deciding to design and make parts meant for CNC machining. The lack of expertise and understanding will result in some pretty serious problem.

Another example, this time about software development. I have over ten years of professional software development using Forth. Someone coming to Forth from, say, Python, is likely to make an mess until they understand how to approach problems in Forth. I also have about ten years professional coding experience using APL. Same thing. Someone coming to APL from other languages is going to run into problems until they gain enough knowledge to write APL.


> I actually studied Phisolophy/Logic at Uni. Good try though.

Appeal to authority.

PS: I studied philosophy too.


Nope. Wrong again. You should probably take that class again.

I am telling you that I evaluated your claim, I didn’t google it or ask ChatGPT.


Argumentum ad nauseam.

Very funny. And also very wrong. Again.

God, this attitude reeeallllyy grinds my gears.

This is precisely why C has outstayed its welcome in so many areas of software development.

Every time some kid looking for a self-confidence boost buys into the idea that using a language with a minefield of archaically-named string manipulation functions somehow makes them a ‘real’, ‘smart’, developer, we are all left a little worse off.

No, it’s not the fault of the language’s design. It’s not even the fault of history - the fact that C was conceived at a time when security wasn’t what it is. It’s these damn kids that only know Python and JavaScript! Why can’t they be as smart as us C developers!

This is all completely ignoring the fact that in 2023 we have no shortage of string manipulation-related vulnerabilities in widely popular and supposedly battle-tested C code. All some version of the typical list completely justifiable human errors that anyone is bound to make writing C.

A language that is so popular but that so few people seem to be able to write secure code with, is not a very good language.

I’m immediately skeptical of anyone that’s not of the view that the single best thing we as an industry can do for security is to drastically reduce the amount of C code in circulation. It always comes down to “I’m set in my ways and I think I’m superhuman”.

My hope is that these modern, sensible systems programming languages successfully eat the world faster than the pool of C developers thins out, as people slowly retire, and more greenhorns clue into the fact that C is being used in more places than it ought to be.

Signed, someone that did learn C in school, and has written it professionally.


> God, this attitude reeeallllyy grinds my gears.

> ‘real’, ‘smart’, developer

It should not. And you are taking my comment completely out of context. 100% out of context. Violently out of context.

I have not even implied that this is about "real" or "smart" developers. C'mon!

This is about TWO things: Knowledge and experience. And that is IT. That's what I said.

So, pretty please, don't put words in my mouth and get all self-righteous about something you invented.

> Why can’t they be as smart as us C developers!

They can! All they have to do is learn and develop the experience base to use the tool correctly. Nobody is saying it can't be done. Again, don't put words where I did not use them.

Do you drive a car every day?

Yes?

Do you think you would do well if you got in the seat of a Formula 1 car?

Of course you would not. Because you lack knowledge and experience in the domain. You can learn. Of course you can learn. And that requires work and dedication.

Blaming the Formula 1 car for the lack of knowledge and experience of the driver is nothing less than ridiculous.


I did start with C at university, C and Scheme. We didn't go very in depth in the "defensive programming" part, especially with C. We talked a bit about safety when we were doing web and database stuff, but I think that's it. I am pretty lucky, one of my friend is in cybersecurity, another is very good with C, and there are lots of people online freely sharing their knowledge, so at least I kind of know what I don't know.

On the other hand, not everyone had that luck. I've seen a good number of people that are very good at what they do but lack more general culture. But it's hard to keep up with everything. Software is a huge world, I think already way too big for everyone to know everything. And it's not just software too, it's important to learn about business too, and maybe a bit of maths here and there isn't a bad idea, and there's also the hardware part, networking, and every day there is more and more and more.

What I mean by this is that I don't know how things were before, but today, for a lot of people that write code, it's not possible to know everything, have everything fit inside your head. In those cases, people usually start asking for more guardrails in their tools, because they're no longer manipulated only by experts. And sometimes the experts themselves ask for guardrails too. So some want tools to change, others don't want them to change, and both have a point.

On one hand, I understand that blaming the tool isn't a good attitude to have. On the other hand, my job consists in building tools for other professionals, and I feel like I have way higher standards for the tools that I produce compared to the tools that I use.


> On one hand, I understand that blaming the tool isn't a good attitude to have. On the other hand, my job consists in building tools for other professionals, and I feel like I have way higher standards for the tools that I produce compared to the tools that I use.

I think your view is of this is reasonably balanced. There is that element of someone without extensive experience not knowing what they don't know.

Well, can we blame them for that?

Thirty years ago, probably not. Today, I think the answer could be yes. A few days of time well spent web searching, reading and watching videos can bring someone from complete ignorance of a subject to having a very good starting point from which to grow. Today there's information on almost anything anyone might want to learn, free and widely available. What, generally speaking isn't widely present is the willingness and dedication to learn.

I have friends my age who stopped learning twenty years ago, maybe even sooner. They just don't care enough. Or maybe they thought they were safe and did not need to. In at least one case I know, that was a huge mistake. He started life as a field service engineer with great prospects. He never bothered to learn anything new. Today he sits in a trailer at an oil field 24/7 manually logging various pressures and temperatures multiple times a day.

I also blame the educational system for some of this. Maybe I was fortunate to have gone to school when I did. We started with assembler. Actually, machine language, raw 1's and 0's. By the time I learned C I had designed a few industrial control computers and fully coded them in assembler. The transition to C was very easy. And nobody had to tell me where the dangers were...because, coming from assembler, it was obvious.


That is a good point, we've never had so much information accessible. On the other hand, it's sometimes hard to know what actually matters. Maybe we (or I) don't know how to fully tap into its potential, but I still feel that I progress way faster when I can talk to someone that has experience. I can access their knowledge, but I can also start to see how they think, how they approach problems, how they solve them, what they value.

I've also heard that for them, it can be valuable to have someone with a fresh outlook on things. You notice things that people got used to, and most of the time things make perfect sense considering the situation, but sometimes there's an opportunity to improve things for the better.

You're right about learning, it's a lifelong process. I do think that doing this along other people helps. Sometimes working on something by yourself can be quite lonely, especially if the people around you are not that much into all of that. That kind of loops back into the discussion about tool. Blaming your environment is counterproductive, but it's still important to pick it carefully.


Dude, your last one is not even the correct way of calling snprintf, what the hell are you talking about?

I mean, none of these are correct. But they will all compile, often with no warnings at all. The format string doesn’t need to be a literal string: this is useful for situations like localization.

No surprise that this completely unjustified vitriolic reaction comes from someone arguing about C.

> snprintf(buf, 80, argv[1]);

yeaaa, this has much bigger problems than a null write...


So then what do you do?

> C string handling practically invites off-by-one errors and horrible security practices out-of-the-box.

At that point it is beating dead horse - it is such well known fact how C strings works. And it's insanity that noone proposed new standard library with better implementation for strings domain. Boo hoo "old programs" bla bla...

And it's total insanity to blame powerfull language for allowing you to do almost anything in it. You don't even need to ask commitee for permission to roll your own _low level constructs_ - how insane is that ? ;)

But keep spitting on what gives you freedom...


There's a framework for C now at https://vely.dev which may help with C strings safety and memory management, among other things.

I found Vely on dev.to and did a test project with it. Pretty neat and solid.

Handling strings in C was enough for me to choose C++…

Any good resource around with all the common C pitfalls and relative solutions?

Beej's guide to C programming is very helpful:

https://beej.us/guide/bgc/html/split/unicode-wide-characters...


Legal | privacy