Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Hash tables in C are not part of the standard. They are part of POSIX, so it is mandated only on UNIX-ish environments. The standard committee has nothing to do with that.


sort by: page size:

I was surprised recently when looking at different hash tables that have been implemented in C to discover that the standard library includes its own hash table. They are even part of POSIX. There is a reason you have never heard of it, or if you have you have never used it. In true POSIX fashion they are close to useless. The implementation doesn't allow you to modify the table after it has been created, you have to pass in all the data when you create the table. There is no add or delete and you can't change any value you pull from the table. It also stores the data in a static area local to the function so you can only use a single table in a program at a time. It doggedly commits every C stdlib sin possible.

C noob here. Why isn't a hash table implementation merged into the c standard library? Is it because the stdlib has to be as thin as possible for some performance reason or something?

Maybe the should be. I know C is all about staying minimal, but native hash tables have proven themselves to be extremely useful in every other language where they are present.

Ehh, hash tables were invented in the 50s and were and are used wherever they are useful. I’m pretty sure a running joke is that every decently sized C program contains a half dozen hash table implementations. They’re not recent.

Even in C it’s not that hard to make a basic hash table. It’s probably like 20 lines of code or something.

"No, because it's possible to implement efficient hash tables with C's primitives -- in fact that's how hash tables in essentially ALL languages ARE implemented."

Because each architecture has a C compiler that's been highly optimized. Popularity plus money invested. That's it. If you were right, we'd see optimizations coded in C even when alternative, optimizing compilers were available. I got a one-word counter to that interestingly enough from "high-performance computing" field: FORTRAN. Free Pascal people are doing fine in performance and low-level code as well despite little to no investment in them.

Seems throwing money at a turd (eg FORTRAN, C) can get a lot of people's hands on it despite some of us shuddering and saying "Get that pile of crap away from me!"


That is picking terms. Many languages don't allow to specify custom hash functions, so you can't specify the hash functions until you do write a custom hash table. A fair benchmark would allow for this, if it is allowed in C.

Again, for the benchmarks game, C is not free to implement an optimized hash table for the problem.

Any of the programs are free to implement a "custom hash function" and many of them do.

hash table != hash function


Agreed this seems like major overkill, but then again so does writing your own hash table from scratch in C. :shrug:

No, a "custom implemented" hash table is not used for the benchmark in C.

A generic hash table from a third-party library is used -- something written to be generally useful, not compromised to look good on a toy benchmark.

https://github.com/attractivechaos/klib/


> It really says something about C that so many useful-but-small systems have a good chunk of code devoted to hash table implementations

For this speed-obsessed project, I wouldn't have it any other way:

  Benchmarking hash lookups in an integer-keyed hash table.

  Table size: 8, keys: 1-8 ====
  upb_inttable(seq): 410 M/s
  upb_inttable(rand): 334 M/s
  std::map<int32_t, int32_t>(seq): 149 M/s
  std::map<int32_t, int32_t>(rand): 170 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 69.9 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 68.0 M/s

  Table size: 64, keys: 1-64 ====
  upb_inttable(seq): 410 M/s
  upb_inttable(rand): 333 M/s
  std::map<int32_t, int32_t>(seq): 32.4 M/s
  std::map<int32_t, int32_t>(rand): 33.4 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 67.9 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 71.4 M/s

  Table size: 512, keys: 1-512 ====
  upb_inttable(seq): 407 M/s
  upb_inttable(rand): 330 M/s
  std::map<int32_t, int32_t>(seq): 20.8 M/s
  std::map<int32_t, int32_t>(rand): 17.2 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 70.8 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 64.6 M/s

  Table size: 64, keys: 1-32 and 10133-10164 ====
  upb_inttable(seq): 394 M/s
  upb_inttable(rand): 334 M/s
  std::map<int32_t, int32_t>(seq): 32.0 M/s
  std::map<int32_t, int32_t>(rand): 30.7 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 71.3 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 70.6 M/s

Spare this light article on bad hash tables. The best one in C is currently github.com/LIMachi/swiss-table but this is missing the security discussion.

Because it's probably the most notable/unusual part of this hash function. Virtually all other hash functions for at least a decade have had their reference implementations written in C or C++.

I believe not including a hashtable in the standard was one of the problems (but I don't know much Scheme)

"Is that really the gatekeeper we want?" Yes, I think so? I mean implementing a hash table in C is really easy. What is there to "forget"? If you can't do it, either you don't know what a hash table is, or you don't know how to program in C.

    enum { table_size = 1024 };
    typedef struct item { int k, v, full; } item;
    static item table[table_size];

    void put(item kv)
    {
        size_t orig = kv.k % table_size;  // dumbest possible hash function
        size_t b = (orig + 1) % table_size;
        // use linear probing for collisions
        while (table[b].full && b != orig) b = (b + 1) % table_size;
        if (b == orig) abort();  // table full
        table[b] = kv;
        table[b].full = 1;
    }

    item *get(int k)
    {
        size_t orig = k % table_size;
        size_t b = (orig + 1) % table_size;
        while (table[b].full && table[b].k != k && b != orig) {
            b = (b + 1) % table_size;
        }
        return (table[b].full && table[b].k == k) ? &table[b] : NULL;
    }
I haven't tested that (just as I wouldn't in a whiteboard interview) so I don't know if it works. I did briefly skim the article we're talking about, but I concluded I didn't have anything to learn from it, so I didn't read it. I found a bunch of bugs in my implementation as I was writing it. And of course it's a pretty dumb hash table: linear probing is very suboptimal, it uses a statically-allocated non-resizable hash table, it's specialized for a certain type of keys, and so on.

But are you saying you can't even do that? Probably it would be bad to hire you for a C programming job, then, unless it was an entry-level position for you to learn C in.

?

Evidently the above comment took me 15 minutes to write. Oh, and now I see another bug or at least lacuna: if you try to update the value of an existing key, it doesn't fail, but it also doesn't update, instead adding a new entry that will never be read. And then I did write a minimal smoke test and run it, and unsurprisingly, it does work:

    int main()
    {
      put((item) { 4, 7 });
      put((item) { 1028, 9 });
      put((item) { 3, 25 });
      printf("%d %d %d %p %p\n", get(4)->v, get(1028)->v, get(3)->v, get(5), get(3));
      return 0;
    }
Writing the test and the above additional commentary, and running the test, took another 8 minutes.

Oh, and there's another bug where it will loop infinitely when the hash table is full and the key is negative. (Actually it invokes UB but typically the result will just be an infinite loop.)

If it takes you 60 or 100 minutes to write this, or to write something better, maybe you still know C and know what hash tables are. But if you throw up your hands and say "I don't remember, it's been a long time since I was in school"? Either you don't know C, or you don't know what a hash table is, which probably means you've never written anything in C but toy programs. (This is a toy program, of course, but many toy programs in C don't need hash tables.)

The compilable and fixed toy program in question is at http://canonical.org/~kragen/sw/dev3/dumbhash.c if you want to probe it for bugs.

It occurs to me that you might have been talking about something that isn't a C programming job. For example, we might be talking about a job programming an HTML database front-end in Python using Django. And of course it would be silly to reject someone for a job like that because they didn't know C, unless they claimed to know C on their résumé. And it's totally reasonable for a Python programmer, or a Python/Perl/JS/bash programmer, to not understand hash tables.


Are you suggesting there aren't mature implementations of hash maps in C you can easily vendor?

Engineers avoiding hash tables, yes, that's one myth. Still waiting for the other four promised in the headline. In the age of javascript everywhere I'd say that hash tables have pretty much won.

I was curious about which corner of software technology the author was coming from where hash tables are routinely questioned (embedded C maybe?), but even the followup post does not quite explain it. ( https://hughewilliams.com/2013/06/03/reflecting-on-my-hash-t... , also references a previous hn discussion)


Maybe you do...

The hashes generally used for things like hash tables don't need to be cryptographically secure, and as such can be substantially simpler. When you start getting into cryptographically secure hashes things get complex enough that you start ending up with timing variations if you're not very careful. Especially once you start talking about table lookups / etc (although that's more common with encryption algorithms than straight hash algorithms).

And the compiler - with C and C++ at least - is free to optimize in ways that introduce timing variations. Which is what sparked this conversation in the first place. So even if your source code doesn't take data-dependent time, the compiler may make the compiled output take data-dependent time. And there's no way around that in pure C / C++.


People who write hash functions are both exceedingly anal about performance concerns and generally interested in having their work compile with any and all reasonable C compilers (yes, even those that aren't gcc) and still have at least good performance.

They also tend to be things that are not designed in code: they are designed in math and then translated to code as the last step, preferably in a way that is able to remain static forever (they should not need maintenance by an engineer, even/especially to update to support random new C compiler builds).

(Some of my friends entered the recent SHA3 NIST hash competition and lasted long enough to have random other mathematicians scrutinize, improve, and finally defeat their math. http://en.wikipedia.org/wiki/Spectral_Hash)

next

Legal | privacy