Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

snprintf has very similar performance pitfalls.


sort by: page size:

snprintf is the same trash, just slower. See eg. http://blog.infosectcbr.com.au/2018/11/memory-bugs-in-multip... discussing the need of an improved scnprintf in the Linux kernel.

Why is snprintf slow? I am surprised that it would be slow especially when compared to methods like asprintf that allocate the buffer.

Yeah, but as soon as you use snprintf you're throwing performance to the wind.The idiomatic C++ solution should beat it hands down. (gulp)

If you really need a fast strcpy then probably not, but in most situations snprintf will do the job just fine. And will prevent heartache.

Yes, even I ran the tests and noticed printing the double in snprintf() is the bottleneck Any idea how we can do that faster and we get the format we need in `nnn`?

Even then, printf and scanf are typically faster (and not even by a little bit, by a lot) than C++ iostreams formatted output, even though iostreams gets all the formatting information at compile-time, while printf has to parse the format string.

On the other hand, if people start to use snprintf in that particular form as a safe way of string copying, compilers could pattern-match this and substitute a direct implementation.


Reason: performance.

If you just do string manipulation, memncpy is faster.

If you need to convert data type, like int to string, then use snprintf.


Printfs can be slow but their performance varies by implementation and may have no meaningful performance implications in many cases.

Excellent response! I have a program that spends 10% of it's time in vfprintf for string processing, and I really think the program should not be spending that much time/any time there. I looked at the libc6 vfprintf implementation and it's pretty esoteric looking stuff. It might be worth my time using your library or something similar to swap it out.

The biggest performance trap they had was copying all their strings in a really hot loop to a vector of characters. I'm not sure what we could do to steer people away from that...

1. nothing generic is as fast as making your own custom solution

2. a lot of libc is lowest common denominator / tons of bloat. printf/sprintf probably does extra locale, multibyte charset, and thread locking shit you don't want.


That surprises me. If you had asked me, I'd have said that with normal optimizations switched on, both programs would boil down to exactly the same single function call that copies a byte-array to the stdout stream. I would have to wonder if there any optimisations someone has missed if there's even a tiny difference.

In the C world, I could image there's a loop at runtime scanning the printf string for % characters, but I would equally imagine that the compiler people have made a special case for printf that silently replaces printf calls with a single string literal parameter with no %s are replaced with a puts call. (Which itself gets optimised to to a byte-array copy.)


I just did a test using std:: and old-fashioned char[]. The fancy std:: is fifteen times slower than the strcat(). In a loop with intensive string manipulation, this could cause the program to get back to you in 15 seconds instead of 1 second. You don't mind waiting?

    Here's the code (Visual Studio 2010):

    #include <string>
    #include <time.h>

    char buf[64], buf2[64];
    clock_t start = clock();
    for(i = 0; i < 100000; i++)
    {    _snprintf(buf, sizeof(buf), "%d", i);
        _snprintf(buf2, sizeof(buf2), "%d", i * i);
    #if defined FAST
        strcat(buf, buf2);
    #else
        std::string s1 = buf;
        std::string s2 = buf2;
        std::string s3 = s1 + s2;
    #endif
    }
    _snprintf(buf, sizeof(buf), "%.4f\n", (float)(clock() - start) / (float)CLK_TCK);
    OutputDebugString(buf);

Unfortunately performance claims appear to be bogus.

1. ospan, performance claims seem to be based on, doesn't do any bound checks, so you can easily get buffer overflow.

2. fast_io generates a whopping 50kB of static data just to format an integer.

So if these benchmark results are correct (I was not able to verify because the author hasn't provided the benchmark source):

> format_int 7867424 ns 7866027 ns 89 items_per_second=127.129M/s

> fast_io_ospan_res 6871917 ns 6870708 ns 102 items_per_second=145.545M/s

fast_io gives 15% perf improvement by replacing a safe format_int API from https://github.com/fmtlib/fmt with a similar but unsafe one + 50kB of extra data. Adding safety will likely bring perf down which the last line seems to confirm:

> fast_io_concat 7967591 ns 7966162 ns 88 items_per_second=125.531M/s

This shows that fast_io is slightly slower than the equivalent {fmt} code. Again this is from the fast_io's benchmark results that I hasn't been able to reproduce.

50kB may not seem like much but for comparison, after a recent binary size optimization, the whole {fmt} library is around 57kB when compiled with `-Os -flto`: http://www.zverovich.net/2020/05/21/reducing-library-size.ht...

The floating-point benchmark results are even less meaningful. They appear to be based on a benchmark that I wrote to test the worst case Grisu (https://www.cs.tufts.edu/~nr/cs257/archive/florian-loitsch/p...) performance on unrealistic random data with maximum digit count. fast_io compares it to Ryu (https://dl.acm.org/doi/pdf/10.1145/3192366.3192369) where maximum digit count is actually the best case and the performance degrades as the number of digits goes down. A meaningful thing to do would be to use Milo Yip's benchmark instead: https://github.com/miloyip/dtoa-benchmark


Sure, but how often is string formatting the bottleneck on performance?

CSPRNGs are not much slower than the alternatives, and really, how often the hot path on data creation falls on the processor, instead of memory access or IO?

Nope, looks like a really bad choice for generic libraries.


Even the "optimized" C version is far from what an experienced C programmer would write if performance was paramount. General-purpose memory allocation, using hash tables with inherently bad spatial and temporal locality, using buffered I/O instead of mapping the file to memory.

FWIW, in a file with 1,000,000 lines, the best-of-3 time for "less filename > /dev/null" is:

  2.508u 0.152s 0:02.66 99.6%	0+0k 0+0io 0pf+0w
and the best-of-3 time for "less -N filename > /dev/null" is:

  2.568u 0.159s 0:02.73 99.2%	0+0k 0+0io 0pf+0w
That is, it doesn't seem like printing sequential is the limiting factor in performance.

This is with "less 458", "Copyright (C) 1984-2012". I downloaded and compiled stock 487 and the best-of-3 times went up to 0:02.94 for both cases.

Checking the source code, it does not appear to use knowledge of the previous output index in order to save time. The relevant code is:

        static int
  iprint_linenum(num)
        LINENUM num;
  {
        char buf[INT_STRLEN_BOUND(num)];

        linenumtoa(num, buf);
        putstr(buf);
        return ((int) strlen(buf));
  }
where

  #define TYPE_TO_A_FUNC(funcname, type) \
  void funcname(num, buf) \
          type num; \
          char *buf; \
  { \
          int neg = (num < 0); \
          char tbuf[INT_STRLEN_BOUND(num)+2]; \
          register char *s = tbuf + sizeof(tbuf); \
          if (neg) num = -num; \
          *--s = '\0'; \
          do { \
                  *--s = (num % 10) + '0'; \
          } while ((num /= 10) != 0); \
          if (neg) *--s = '-'; \
          strcpy(buf, s); \
  }
  
  TYPE_TO_A_FUNC(linenumtoa, LINENUM)

I forget the exact reasoning now, but I remember it being about 10x slower than memcpy or strncpy. I think the main reason was because of the need to parse the format string.
next

Legal | privacy