Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Probably because the underlying implementation is a binary search tree which requires a comparison operator. Haskell likewise has Data.Set which requires an Ord instance. There's HashSet which requires a Hashable instance which can be automatically derived for any data type, so in principle you don't need ordering, but I would imagine in C++ you would have to provide your own hashing function from whatever data type you're trying to put in a hash set.


sort by: page size:

C++ 'sets' are ordered mutable binary search trees in most implementations.

Not the ties and hashtables of other languages.

With strict weak ordering, you really only have the ordering of to elements.

Ada took care of that polysemy between ordered and hash sets a long time ago.

But ya it took them to long to implement 'contains', but JavaScript requiring ordered sets while allowing multiple implementation details complicated it IMHO. Had they chosen a structure vs a time complexity it would have been easier to extend.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


> Why is the default set implementation ordered in the first place?

"Sorted" rather than ordered (an order can be arbitrary and I'd personally associate the word with insertion order).

> The formal data structure is unordered

The formal data structure doesn't have complexity bounds, the STL does.


Sorting or hashing are the main ways to:

- construct sets

- remove duplicates

- construct and query dictionaries (especially multi-maps, btrees, etc)

- group equivalent elements

- find the number of occurrences

- etc

Hashing solutions tend to be popular in modern language standard libraries, but not in C or C++ where in-place algorithms and simpler implementations are preferred. You could also make the argument that it's much harder to use hashes with generics as it's not always clear how to define good hashes for complex data types. Alternatively, sort requires only a < implementation.

> bucket sort which you could say runs in linear time.

This only works for integers in a small range. Also consider strings, doubles, or lexicographically ordered arrays.


> Sets and Maps are sorted (by insertion order)

That's not what a "sorted set" is. C++ and Rust provide sets (and maps) that sort based on an arbitrary comparison operator.

> ?

?

> but someone will likely be exhausted by that addition

I suppose. I guess it's one of those things that they really should have got right the first time. Like String.replaceAll(). I guess they will never add String.replaceAllSafe() - well just have to rely on linters to tell us that the API is terrible forever.


This has nothing to do with the language and is just a huge flaw in the design of the standard library: inability to supply a comparator function that is used for data structure types like sets, etc.

But it does have a method to sort containers in its standard library, `sorted`, that C++ doesn't have. And it's trivial to use it to sort lists, sets, and dicts...

I believe the issue is that C++ does not track what are, and are not pure functions, so the ordering optimisation is, in general, unsafe.

While Haskell is undoubtedly better suited to writing concise implementations of sorting algorithms, the C++ versions given could have been written in a much shorter and clearer fashion.

C++ also has the `sort()` function that allows you to sort any unsorted container. But that's not a replacement for a sorted container like `set` or `map` though. Because `set` or `map` allows you to insert elements at O(log n) runtime. If you have to sort every time you insert using the `sort()` or `sorted()` functions, the run time becomes O(n log n).

All this article has shown is that std::sort is much more optimal than std::unordered_map on the C++ standard library used.

Every std::unordered_map I've used is surprisingly slow. Normally hash-tables are slow because it's so easy to write a slow hash-table, but std::unordered_map is slow for other reasons that I have had explained to me and then quickly forgot :(.

I also find it strange that unordered_map is used rather than unordered_set. Not sure if it would make a performance difference, but if we are going for naive implementations, that should be at the top of the list.


I'd expect better write up from Daniel. std::unordered_set is implemented with chained buckets, which incurs extra memory references. Cursory google revealed that an open addressing hash set would be about 3x faster, which would handily beat the sort version.

All true. On the other hand, std::sort in C++ works on RandomAccessIterator, which means that it can sort vectors, C arrays, deques, strings, your custom container type, etc. The Haskell implementation can only sort lists. Furthermore, the Haskell type signature fails to express that the input and output have the same size. But std::sort necessarily preserves its size, because you pass it a range, and a range cannot access its underlying container to change its size (a limitation leading to bizarre APIs like std::unique).

I wonder why they don't compare against std::sort? That seems like most people's standard choice.

Unlike qsort's n^2 worst case, the conditions that degrade hash tables are very common: all you have to do is specify a crappy hash function for your load, or size the table poorly.

(I'll assume you're talking about the quicksort algorithm rather than the qsort() function because you're comparing it to hash table accesses in general rather than a specific implementation of them.)

But I'm not sure I agree. The worst case for quicksort is an already-sorted list. I run into this scenario all the time, though usually in a slightly different form. When I iterate through the elements of one of the tree-based containers std::set and std::map they emerge in sorted order. If, inside my loop, I then insert them into a similar container I end up with worst-case performance. The other day I replaced std::set with std::unordered_set and saw a dramatic increase in performance, although it may not have been entirely due to this effect.

On the other hand, a non-crappy hash function should be available to everyone at this point. Personally I like FNV-1a, but haven't found an issue with whatever my C++ standard library is supplying. I have spoken with other programmers though who didn't realize hash tables needed some empty space to perform well or had stumbled into a pathological case with their oversimplified hash function.

This is kind of a silly semantics argument... but, if you interview for a job and look like a deer in the headlights when I said hash tables aren't O(1), NO HIRE.

Gee Ptacek, what do you have against deer? I can just imagine some poor interviewee with a bright desk lamp shining into his eyes...


> Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations

This doesn't match my experience at all. C++ trees are not cache-friendly; they're pointer-chasing (and there's no arena implementation in the STL). Second, any sort of ordering structure (be it through a tree or through sorting + binary search) is notoriously prone to branch mispredictions. They look great in a microbenchmark if you search for the same element all the time and the trees are small, but in a practical application, it can be incredibly painful. Pretty much by design, every choice is 50–50 and unpredictable.

std::unordered_map from most STLs isn't a fantastic hash table, but in my experience it absolutely destroys std::map (and is also comfortably ahead of something like std::lower_bound on a sorted array). All of this is assuming large-ish N, of course; for small N (e.g. below 30, usually), the best by far is a simple unsorted array and just going through it linearly.


> That means there is a lot of overhead for doing something simple like an in-place qsort

Well, the problem not so much that there is a lot of overhead for doing an in-place qsort. The problem is that "in-place" anything is impossible in Haskell, it violates referential transparency.

Merge sort however needs no in-place update and can trivially (in Haskell world) be parallelized (merge sort is associative, you can use it as the reduce step of MapReduce). So you are correct in concluding that mergesort is used in Haskell.

You state "something simple like an in-place qsort", your statement about qsort being simple makes some assumptions about your programming languages world, which are not the same for all languages especially not Haskell. No whether all this is a pro and a con depends on what sort of coding you're doing of course.

I think that people call Haskell's code more elegant because they (like me) find C++ template and type annotations to be annoyingly verbose and messy for no good reason. For example for the two quicksort functions:

"template<typename RandomAccessIterator> void quicksort(RandomAccessIterator first, RandomAccessIterator last)"

vs

"quicksort :: (Ord a) => [a] -> [a]" (where "Ord a" means that a is an instance of the typeclass of ordered items).


> Even then, std::vector has set-like operations through binary_search or std::make_heap in C++, so it really isn't that hard using a sorted (or make_heap'd) std::vector in practice.

Except std::vector does not enforce the sorted or heap constraints when adding or removing elements so eventually someone working on your code will break them.


Indeed. std::unordered_* implementations are hogtied by the interface and complexity requirements in the C++ standard (it basically mandates chained buckets), and more generally the need to work with almost any hash function (linear probing would suck with crappy hash functions for example).

Jo Muñoz wrote a cool blog series on the current implementations in circulation, the flaws with the interface, and how his implementation (Boost multi_index) made further optimisations despite the constraints.

Don't miss the other two parts of the series - look under October on the right. He did extensive bookmarks in November as well.

http://bannalia.blogspot.co.uk/2013/10/implementation-of-c-u...


All the (C++) defaults are wrong, of course.

But in this case I believe the choice to have a default was itself the error. I have been persuaded that either you know which Ordering you need, and thus you should specify what you meant - or else you should not be using these APIs at all. If you aren't confident of the correct Ordering, you are not in fact confident your algorithm is correct, so don't do it.

next

Legal | privacy