Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> the C std lib is the weakest part of the C language and it should only be used as a fallback.

I've been musing for a while now: what would it look like if we were to discard the C library and design a new one, leaving the language itself intact?



view as:

I think it could be very nice.

C is not perfect, there are some parts of the syntax that I strongly dislike, like casting or function pointers declaration...

But it is overall a good enough syntax, much simpler than C++.


Amending the syntax is fun but rapidly becomes a slippery slope; soon enough you find yourself designing a new successor language, as has been done many times before. Simply scrapping the mostly-unhelpful C stdlib and inventing new, modern abstractions for allocation, IO, text, threading, etc seems like a more tractable problem.

I fully agree.

It has the same fundamental problem, though: you have to rewrite most existing code, which hinders adoption. In this case, it might actually hinder it more than also improving the language itself, since people would be more willing to take that leap if there are more benefits to be had from it.

Stdlib is probably the most successful library in the history, not sure how it is “unhelpful”.

There are several libraries or projects where people have done exactly that.

You often end up with some kind of structure, or variations of structures, for strings:

    struct string {
      size_t length;
      char data[];
    };

    struct string {
      size_t length;
      size_t alloc;
      char *data;
    };
Those are just examples. The tricky part is figuring out the different ownership use cases you want to solve. Because C gives you so much freedom and very little in the standard library, you end up with a lot of variations. You might use reference-counted strings, owned buffers, or string slices, etc. You might want certain types to be distinguished at compile-time and other types to be distinguished at run-time.

An example can be found in the Git source code.

https://github.com/git/git/blob/master/strbuf.h

The history of changes to this file is interesting as well. This is a relatively nice general-purpose string type—you can easily append to it or truncate it.


IMHO that does not solve the main problem, that is individual lifetime management.

I've seen many libs using this style of strings, not convinced by the practicality.


What is "individual lifetime management"?

It sounds like you’re rephrasing part of my comment back to me, or maybe I’m misinterpreting what you’re saying.

If you’re not convinced of the practicality, it sounds like you are simply not convinced of the practicality of doing string processing in C at all, which is a fair view point. String processing in C is somewhat a minefield. Libraries like Git’s strbuf are very effective relative to other solutions in C, but lack safety relative to other languages.


No, I simply am using a different approach, still in C, where strings are simple char*, null-terminated, nothing hidden with magic fields above the base address of the string.

The trick is to pass an allocator (or container) to string handling functions.

If/when I want to get rid of all the garbage I reset the container/allocator.


Yeah, you should have just said that in the first place.

I’ve seen similar approaches, e.g. with APR pools, and if your application can work within those restrictions, it’s very convenient.


You can backport Rust standard library to C using https://github.com/eqrion/cbindgen .

The old MacOS (pre-X) did just that. Strings were all "Pascal strings", ie. with the first byte containing the length of the actual string.

Building blocks for memory were also very different from stdlib, notably the use of Handles, which were pointers of pointers, so that the OS could move a block of data around to defragment the heap behind your back without breaking the memory addressing.


Pascal strings are also kind of bad though. All sub-string operations need allocation, or have to be defined with intermediate results which aren't "really" strings, so in that sense it's not an improvement on Zero-terminated strings. Equality tests are cheaper which is nice, since strings of different lengths compare unequal immediately, but most things aren't really improved.

C++ string_view is closer to the Right Thing™ - a slice, but C++ doesn't (yet) define anywhere what the encoding is, so... that's not what it could be. Rust's str is a slice and it's defined as UTF-8 encoded.


D's strings were defined to be UTF-8 back in 2000. wstring is UTF-16, and dstring is UTF-32.

Back then it wasn't clear which encoding method would turn out to be dominant, so we did all three. (Java was built on UTF-16.)

As it eventually became clear, UTF-8 is da winnah, and the other formats are sideshows. Windows, which uses UTF-16, is handled by converting UTF-8 to -16 just before calling a Windows function, and converting anything coming back to UTF-8.

D doesn't distinguish between a string and a string view.


What’s the ownership story for string views?

They don't own anything. It's just a pointer and length. They don't allocate/deallocate.

I mean clearly something needs to own the buffer for a new string.

Sure, but that's not the string_view's problem, you can't just make string_views, the string you want to borrow a view into needs to exist first.

Imagine you go to a library and insist on borrowing "My Cousin Rachel", but they don't have it. "Oh I don't care whether you have the book, I just want to borrow it" is clearly nonsense. If they don't have it, you can't borrow it.


Walter is talking about D, and he said this:

> D doesn't distinguish between a string and a string view.

In C++ std::string owns the buffer and std::string_view borrows it. If there is no difference between the two in D, then how is this difference bridged?


You can use automatic memory management and not worry about it. Or you can use D's prototype ownership/borrowing system. Or you can encapsulate them in something that manages the memory. Or you can do ownership/borrowing by convention (it's not hard to do).

Automatic memory management makes copies?

No. Another word for automatic memory management is garbage collection.

I guess I should rephrase. Let's say I have a string, which owns its buffer. What happens in D if I take a substring of it? Does a copy of that section occur to form a new string?

A lot of people don't know about this but Microsoft is taking steps to move everything over to utf-8.

They added a setting in Windows 10 to switch the code page over to utf-8 and then in Windows 11 they made it on by default. Individual applications can turn it on for themselves so they don't need to rely on the system setting being checked.

With that you can, in theory, just use the -A variants of the winapi with utf-8 strings. I haven't tried it out yet as we still support prior Windows releases but it's nice that Microsoft has found a way out from the utf-16 mess.


The A-variants had problems years ago, which is why D abandoned them in favor of the W versions.

I don't mind seeing UTF-16 fade away. We've been considering scaling back the D support for UTF-16/32 in the runtime library, in favor of just using converters as necessary. We recommend using UTF-8 as much as practical.


> with the first byte containing the length of the actual string

And the wheels fall off with the first string longer than 255 characters.


Which is why Free Pascal strings are so awesome. I've personally stuffed a billion bytes on one, without issues. They are automatic reference counted, and as close to magic as you can get. You can return one from a function without issue.

However, Free Pascal has the worst documentation of any major project I've ever encountered (The exact opposite of Turbo Pascal), so I can't link to a good reference. Their Wiki is a black hole of nuance and sucks all useful stuff off the internet.


You can fix this issue by using a variable width integer encoding for the size.

It might fix that particular issue, but you still have the same problem that NUL terminated strings have: it's not possible to cheaply create views/slices of a string using the same type.

I remember that era well! During the first few years I used C, I never touched its standard library at all, using the Mac Toolbox instead. This was a common practice, which later carried over into C++.

The creator of this library (antirez) is a regular here on hn.

I believe this is used by Redis.

https://github.com/antirez/sds


Glib answer (but also relevant because mentioned in the article, too): it would look a lot like how a lot of people write C++.

>Glib answer

A Freudian slip, methinks.


How so? First definition I find of glib is "(of words or the person speaking them) fluent and voluble but insincere and shallow", which is mostly what I meant about that answer. There was some sincerity in my answer, but certainly somewhere in the border space of irony and sarcasm, which many people do take as insincerity.


The problem is that it's just too tempting to write something like

    my_function(my_var, 3.6, "bzarflo", my_other_var, false);
The string handling functions are part of the story, but the null-terminated char * is produced when the compiler reaches a string literal, and writing code without being allowed to just use string literals when it's convenient tends to feel like coding with oven mitts on.

It's entirely possible to write a wrapper function with a short name to convert string literals to actual string objects.

    my_function(my_var, 3.6, $("bzarflo"), my_other_var, false);
Isn't that much more of a mouthful, and as long as 'my_function' knows to free it, then you're A-OK! The only trouble is '$()' isn't legal in standard C, so a real solution would have to be something like 'str()'.

This caught my eye the other day & looks quite promising, though I haven't spent much time looking at it so I can't comment on it's memory safety:

https://github.com/tylov/STC


I'm in the WG14 and my opinion is that there isn't one good way to do strings it all depends on what you value (performance/ memory use) and the usage pattern. C in general only deal with data types, north their semantic meaning. (IOW We say what a float is not what it is used for). The two main deviations from that are text and time and both of them are causing us a lot of issues. My opinion is that writing your own text code is the best solution and the most "C" solution. Te one proposal i have heard that i like is for C to get versions of functions that use strings that take an array and a length, so as to not force the convention of null termination in order to use things like fopen.

Has WG14 considered adding slices to C? [1] Introducing slices would naturally give way to a better string library.

[1] https://www.digitalmars.com/articles/C-biggest-mistake.html


Its been 50 years so pretty much everything has been considered. In my opinion the mistake was not having arrays decay in to pointers but rather arrays should be pointers in the first place. An array should be seen as a number of values where with a pointer pointing at the first one. I think adding a third version of the same functionality would just complicate things further. (&p[42] is a "slice" of an array) Another thing I do not like about slices that store lengths, is that they hide memory layout from the user and that is not a very C thing to do.

If you think about arrays as pointers, you will get a lot of things wrong, e.g.

float m[10][10];

it not a an array of pointers, but a 2D dimensional array with 2D memory layout.


You are right, sizeof is the other big difference. I think these differences are small enough that it was a mistake separate the two. The similarities / differences do make them confusing.

How would you express a 2D memory layout with only pointers?

An array of pointers to arrays? Basically, a `T**` C#'s "jagged" arrays are like this, and to get a "true" 2D array, you use different syntax (a comma in the indexer):

    int[][] jagged; // an array of `int[]` (i.e. each element is a pointer to a `int[]`)
    int[,] multidimensional; // a "true" 2D array laid out in memory sequentially

    // allocate the jagged array; each `int[]` will be null until allocated separately
    jagged = new int[][10];
    Debug.Assert(jagged.All(elem => elem == null));
    for (int i = 0; i < 10; i++)
        jagged[i] = new double[10]; // allocate the internal arrays
    Debug.Assert(jagged[i][j] == 0);

    // allocate the multidimensional array; each `int` will be `default` which is 0
    // element [i,j] will be at offset `10*i + j`
    multiDimensional = new double[10, 10];
    Debug.Assert(multiDimensional[i, j] == 0);

Yes, this is people with pre-C99 compilers that do not support variably modified types sometimes do. It is horrible (although there are some use cases).

I plan to bring such a proposal forward for the next version. Note that C already has everything to do this without much overhead, e.g. in C23 you can write:

  int N = 10;
  char buf[N] = { };
  auto x = &buf;
and 'x' has a slice type that automatically remebers the size. This works today with GCC / clang (with extensions or C2X language mode: https://godbolt.org/z/cMbM57r46 ).

We simply can not name it without referring to N and we can also not use it in structs (ouch).


You know what i think about auto :-)

How is this not a quality of implementation issue? Any implementation is free to track all sizes as much as they want with the current standard.

Either a implementation is forced to issue an error at run time if there is an out of bounds read/write and in that case its a very different language than C, or its feature as-if lets any implementation ignore.


Tracking sizes for purposes of bounds checking is QoI and I think this is perfectly fine. But here we can also recover the size with sizeof, so it is also required for compliance:

https://godbolt.org/z/qh7P93Tcd

And I agree that this is a misuse of auto. I only used it here to show that the type we miss already exists inside the C compiler, we simply can name it only by constructing it again:

char (buf)[N] = ...

but we could simply allow

char (buf)[:] =

and be done (as suggested by Dennis Richtie: https://www.bell-labs.com/usr/dmr/www/vararray.pdf)


GLib is an alternative to the standard library.

Legal | privacy