Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I forget the exact reasoning now, but I remember it being about 10x slower than memcpy or strncpy. I think the main reason was because of the need to parse the format string.


sort by: page size:

Parsing is probably slower than memcpy

Ah, I thought you meant the speed of parsing the payload didn't matter (scanf performance was also discussed). Yes, parsing the format string in printf or scanf isn't a big deal.

I was under the impression that C strings were inherently very fast because of how incredibly cache-friendly they are. You should not ever have to count all the bytes (memoize that!) but if you do, this is the best format there is for it.

Yes, it seems I lied when I said sprintf is surprisingly fast. For most practical purposes it's fine, but if you need to do string manipulation in a hurry, strcpy/strlcpy are going to be faster.

See https://gist.github.com/nneonneo/f9dc5c1d1d342d25eae47d56dba... for a test harness. On my Mac, with a 16-byte string, 10 million loops:

    memcpy:     77406 us
    strlcpy:   101956 us
    strncpy:   177247 us
    sprintf:   730819 us
    snprintf:  699844 us
memcpy smokes sprintf - around 10x faster. The vast majority of sprintf's time is spent in setup; for large strings, the gap closes substantially.

Reason: performance.

If you just do string manipulation, memncpy is faster.

If you need to convert data type, like int to string, then use snprintf.


With that enabled it's comically slow, but even without the stdio syncing (and even operating on stringstreams) it's still much slower. For simple one-off formatting it's hard to notice, but once you write a parser for some engineering file formats (most of which are still ASCII) and have to read a few million floats here or there it becomes a really stark difference. For parsing floats specifically there are some implementations around which are optimized more, though some of the "fast and clever" ones are less accurate.

Because it's an abstraction on top of an abstraction on top of an abstraction. You need to leak parts of the lower layers to higher levels in order to get the speed back (and allow zero-copy). There are probably some non-leaky optimizations that can be done as well, although that's been going on for decades.

Another biggie is floating point. Converting binary floating point to decimal floating point and vice versa is VERY VERY VERY SLOW, difficult to get right, and difficult to predict results. Unfortunately, ieee754 decimal float just isn't getting any adoption, so we're stuck doing these costly conversions every time we deal with text formats or big float implementations.

Actually, scanf and printf are just in general slow because of their general nature. They need to handle a TON of options and get really bloated and complicated as a result.


You are missing the point. The format string is tens of bytes, parsing it is fast no matter what compared to reading large amounts of data.

> I was mostly replying to your assertion that "that's not even remotely how strcmp is implemented", which, for most definition of "even remotely", is false.

eglibc's SSE2 implementation of strcmp is just over 5k of machine code, whereas the simple implementation compiles to 56 bytes on x64-64. That was my definition of "not even remotely." I did not mean to imply that it was a fundamentally different algorithm, only that it was a far more sophisticated and optimized implementation of the same algorithm. My apologies if this was unclear or appeared overstated.

> That's a tautology at best, and meaningless at worst.

By "these abstraction possibilities" I meant the ones you mentioned, which is true.

> And similarly, any Python application that requires (insert some uncommon requirement ..) can do what C can with the same kind of help that strcmp() gets - by delegating to the layer that does it best.

That's great and I fully support that. What I am arguing against is high-level language cheerleading that discounts the importance of C (or assembly for that matter). Since you mention PyPy, I have to say that their PR is some of the worst in this regard; some of their "faster than C" claims are actively misleading, like this one that benchmarks some generated string formatting code against sprintf() (which re-parses the format string every time): http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-a...


It always surprises me how slow streams are. fscanf should be relatively slow because it has to parse the format string at runtime. So the new C++ format should be (and I believe is) much faster

It's 2x slower per iteration, but much faster to invoke, and friendlier to the microarchitecture. It's not as simple as you're making it out to be.

Also, using strlen() on a 4k string at all borders on malpractice.


> They were the fastest way to do string operations on those early CPUs, but I don't think this the case on modern CPUs.

Modern CPUs still have a fast path for string operations, and I recall hearing that they even had some improvements not too long ago (in Sandy Bridge or other recent arch).

CPUs may be smart enough to detect a memcpy done with a loop, but REPxx is the preferred way - even on modern CPUs.


I've added a couple of examples trying to find a more readable version, which actually isn't trivial. Sorry for the 'post-edit' :)

As for performance: I don't think such details matter much, first, compilers are pretty good to turn "readable but inefficient" code into the same optimal output (aka "zero cost abstraction").

And a really performance-oriented strcpy() wouldn't simply copy byte by byte anyway, but try to move data in bigger chunks (like 64-bit or even SIMD registers). Whether this is then actually faster also depends on the CPU though.


> You're skipping over half the point of the article by using a (fixed) source string, the size of which is known in advance.

You are right. I should have focused more on that instead, that seems more relevant to why the author of the article is suggesting memccpy. I am curious as to whether or not it really is the case that memccpy is "optimally efficient" in practice over the alternatives. Would you like to prove or disprove that statement yourself? I modified the code a bit; it uses strlen to calculate the length of the string passed, and the string is argv[1]. memccpy is still just as slow. Is this a more acceptable approach to you? In this case we do not know neither the string, nor its length in advance. strcpy still outperforms memccpy. Is this sufficient to disprove the claim that memccpy is "optimally" efficient to other alternatives? The other criterion was being widely adopted, in which case, well, strcpy also looks good. Moreover, as dwheeler pointed out, memccpy is a tad difficult to use in practice. I will give strlcpy a try, too, since I prefer that over strcpy. In any case, I am not convinced that these criteria hold true for memccpy over the alternatives.

> on local variables (buffers) without observing their results

What do you mean exactly? I did observe the results of the buffer. See the printf, or are you not referring to that?

> both your benchmark functions overflow their internal buffers.

Would you please elaborate on it, and its relevance? Are you referring to N being too high?


Why is this faster than the stdlib? What does it do to achieve better performance?

I agree, although the arguments in that article aren't the strongest. In particular:

The main reason given for 'String' being slow is that it's immutable and hence leads to many duplicate Strings being created. In fact, the main reason Strings are slow is that their linked-list structure has a lot of overhead, and may be scattered around in memory. For example, the String "abc" will be represented something like this:

    +---+---+   +---+---+   +---+---+
    | * | *---->| * | *---->| * | *---->[]
    +-|-+---+   +-|-+---+   +-|-+---+   
      |           |           |
      V           V           V
     'a'         'b'         'c'
Those arrows are pointers, which can point to locations arbitrarily far away. From this we can see that just storing a (fully evaluated) String is expensive, since each character requires a pair of pointers (in addition to the unique characters themselves). Processing the contents of this String is also slow, since we need to dereference two pointers for each character (one to get the character, one to get the tail of the String). Since the data may not be contiguous in memory, this can make poor use of the CPU caches, which might otherwise speed up these dereferences.

Compare this to a C-style string, which would look like this in memory:

    +---+---+---+---+
    | a | b | c | 0 |
    +---+---+---+---+
This "packed" representation requires no long-distance dereferencing, the memory is contiguous so it will make good use of the cache, and we can process the characters directly without having to chase pointers.

That article describes the ByteString type as a "list of Word8 objects", but that's a bit misleading. Firstly we often say "list" to mean a chain-of-pointers like the first diagram above, e.g. that's exactly what Haskell's built-in list type is, and Haskell's String is such a list. ByteStrings store their Word8 objects more like the second diagram, so it would be better to call them an "array of Word8 objects". Secondly, ByteStrings use a clever three-layered structure to speed up common operations:

- The raw Word8 characters are stored in arrays, like the C-style string above (but without the NUL byte at the end)

- These arrays are pointed to by values called "chunks". These are "fat pointers", i.e. they also store a length and offset value.

- A ByteString itself is a list of chunks (like the first String diagram, but instead of values like 'a', 'b' and 'c' the values are chunks).

This makes ByteString really fast, for three reasons:

- Some operations can be performed by just fiddling at the list-of-chunks level. For example, to append X and Y, the result is just a list of X's chunks followed by Y's chunks; no need to touch the underlying chunks or arrays.

- Some operations can be performed by just fiddling the length and/or offset of a chunk. For example, incrementing a chunk's offset will chop characters off the start (they're still stored in memory, but they will be ignored); decrementing a chunk's length will chop characters off the end.

- The same underlying Word8 arrays can be pointed to by many different chunks in many different ByteStrings; i.e. we rarely need to copy the actual characters around.

As an example, let's say we read the ByteString "[\"foo\", \"bar\"]" (without the backslashes), we parse it as JSON to get a list of ByteStrings ["foo", "bar"], then we concatenate that list to get the single ByteString "foobar", here's how it might look in memory:

                         BS--+---+   BS--+---+
    "foobar":            | * | *---->| * | *---->[]
                         +-|-+---+   +-|-+---+
                           |           |
                      +----+           +----------------+
                      |                                 |
                      |  List+---+      List+---+       |
    ["foo", "bar"]:   |  | * | *------->| * | *---->[]  |
                      |  +-|-+---+      +-|-+---+       |
                      |    |              |             |
                      |    |              |             |
                      |    |              |             |
                      |    V              V             |
                      | BS--+---+       BS--+---+       |
    "foo" and "bar":  | | * | *---->[]  | * | *---->[]  |
                      | +-|-+---+       +-|-+---+       |
                      |   |               |             |
                      |   |               +---+         |
                      |   |                   |         |
                      V   V                   V         V
                     Chunk---+---+           Chunk---+---+
    (chunks)         | 3 | 2 | * |           | 3 | 9 | * |
                     +---+---+-|-+           +---+---+-|-+
                               |                       |
           BS--+---+           |                       |
    Input: | * | *---->[]      |                       |
           +-|-+---+           |                       |
             |                 |                       |
             V                 |                       |
             Chunk+---+---+    |                       |
    (chunk)  | 14 | 0 | * |    |                       |
             +----+---+-|-+    |                       |
                        |      |                       |
                        |      |                       |
                        |      |                       |
                        V      V                       V
            Array---+---+---+---+---+---+---+---+---+---+---+---+---+
    (array) | [ | " | f | o | o | " | , |   | " | b | a | r | " | ] |
            +---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    
(Note that the exact position where arrows "land" doesn't matter; they're pointing to the entire datastructure they "hit")

Here we can see that there's only one copy of the underlying text; that the substrings 'foo' and 'bar' are simply chunks with length 3, offset by appropriate amounts; and that the resulting 'foobar' ByteString is just a list of these two chunks.

This approach of "store things once, then do all processing with indices" can be very fast (even faster than C in some cases https://chrisdone.com/posts/fast-haskell-c-parsing-xml ). Whilst we can obviously write this sort of algorithm in any language, re-using arrays and chunks like this relies on them being immutable, which is more idiomatic in Haskell. In particular:

- Languages which tend to use mutation aren't well suited to this, since mutating one value can have unforseen consequences to those which are re-using the same components. Copying is safer and more predictable in the face of mutation.

- Languages which favour some built-in interface rather than high-level functions may need to copy values in order to comply with these interfaces. In particular, C's array syntax will work for arrays and chunks, but not for the overall ByteString structure (a list of chunks).

- If we want NUL-terminated arrays, we'll need to make copies when truncating strings, to avoid the NUL byte overwriting the truncated part.

- Re-using values can make it hard to keep track of which parts can be freed. Garbage collection (and other approaches, like linear types, borrow checking, etc.) can make this easier.

The difference between lazy and strict ByteStrings is just whether the overall list-of-chunks is lazy (chunks generated on-demand, useful for streaming) or strict (chunks are generated up-front, closer to using one big array, hence potentially faster and more predictable).

The Text type is just a ByteString paired with a particular encoding method, e.g. 'UTF-8'.

The article you link also talks about fusion, claiming that's why Text (and hence ByteString) is faster, or avoids intermediate allocations. Fusion is great, but care must be taken if we're going to rely on it; e.g. very minor changes to an expression (say, to make debugging easier) can stop things from fusing, greatly changing the compiled code's speed and memory usage.

On the subject of fusion, it's worth noting that Haskell's built-in list type is also subject to fusion, often resulting in zero allocation (e.g. generating a list of characters, then counting them, might compile down to a single loop with a single counter variable, and no Chars/Lists/Strings in sight!). Again, it can be tricky to ensure that this happens. One approach is to test for it with something like https://hackage.haskell.org/package/inspection-testing


Benchmark at the README says it is 4x slower than memcpy.

Modern strcmp and strlen both contain optimizations based on modern hardwarae, and I believe strlen can be faster than strcmp, but that's overoptimizing for performance.

I've only ever used the C implementation. It's slow.
next

Legal | privacy