Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

They are both technical, but 'order by bytes desc' has got to be more expressive than 'sort -nr'. It's almost natural human English, whereas the latter doesn't express anything.

That said, I don't know how much time it would genuinely save. As with most of these tools, you shouldn't be installing them on production servers, so you still have to know Bash anyway.



sort by: page size:

Yes, I know that this is the correct way of doing it in bash. I posted this because someone might test the speed of the two scripts and conclude "bash is faster" while they actually measured the speed of "sort" probably.

I would also add: if it’s a one time thing, I would just do it in visual studio code or any other editor that doesn’t die on a 1GB file. And I have 12+ years experience with bash and unix tools, so it’s not about a lack of knowledge or experience. There isn’t anything magical regarding “sort” versus another tool, if there isn’t a need for automation they are equivalent.

You can do that, it uses natural sort (not filesystem depending)

Or just `sort -u` (if you are using GNU sort, not sure about other implementations)

Another difference is that sort is optimized to handle large files [0]

[0] https://unix.stackexchange.com/questions/279096/scalability-...


This is very memory intensive. Which may not matter if the data volume is small enough. But it is also a bit hard to understand, at least not so obvious at first sight. For most use cases sort -u would be ideal and way simpler to understand, if you don't mind having an ordered file at output.

I'd also consider things like piping to awk or finding the largest files in a set of directories. Human readable sorting is much more complex.

Both of those Perl solutions could be shorter (and probably quicker) in that you don't need the "sort -rn" as you can do that sort directly in Perl.

I use /bin/sort a lot, so I wrote this BASH script that wraps around `/bin/sort`, and uses a cache to speed up subsequent sorts.

That's BSD's sort, not GNU sort, isn't it?

This is far more useful for SQL users than chaining several sed/awk/sort commands on pipe (although a bit against nix principles).

I guess you wouldn't, you'd have a sort function afterwards.

...for Unix' insistence on composability, the shell tools are often unnecessarily monolithic, probably because that's the only sane way if the only type you have in interconnect is `string`.


In my experience there's a performance hit in filesystem-heavy work like opening and closing a lot of small files.

Still, its more than offset by the convenience of performing bash operations on Windows; remembering `du -csh ./* | sort -h` (sort directories by size) is easier than whatever Powershell would have me type.


I almost always prefer the GNU utils to the BSD ones. The GNU ones usually allow arguments in nicer orders. GNU `sort -h` can sort the human-sized output of GNU `du -h`. I've noticed tons of niceties from these sorts of tools on Ubuntu are completely missing from BSD based OSX.

One man's bloat is another man's convenience.


I do a lot of bash scripting so I know `/usr/bin/sort` decently well. Using this technique I can use my bash scripting abilities to enhance my vim abilities.

You didn't need to; unix sort has been able to do this for way more than 10 years =).

I guess shell scripting and the standard commands are convenient interfaces to a lot more functionality. The interactive use (and pipes) tend to make the linear processing super terse and easy (but lots of other things become annoying).

In a type-safe language, it’s typically impossible to specify a function which can do as many things as (eg) sort can, and which doesn’t require specifying a huge amount of default arguments on every invocation. Therefore you end up with a limited function, a huge annoying function, or many small functions. In any case, changing from one kind of sort to another (eg how to sort things, what to sort, how to order them, whether to keep unique elements, whether to be stable, whether to take multiple inputs, whether to assume the inputs are sorted (so just merge), and so on) becomes difficult whereas with a tweak-in-many-ways command like sort it is easy. In a language like CL with optional arguments and no types, this problem can be less hard but whereas sort will likely work fone for a large dataset, your favourite library function may not. (In particular it almost definitely won’t use on-disk storage when needed or support parallelism. Partly note that storing objects from your favourite language to disk is probably hard but if you only work on lines of text it’s easy).

The other big difference is the data types. The data type of shell programs is basically a sequence of lines, each of which is sometimes broken up into one or more records by some other separator. In many languages the sequence operations are all about getting the nth element, or extending them, or changing elements, or deleting them, or splitting the sequence up into other sequences. And these operations typically identify the elements by the position. In shell scripts, many of these operations are unavailable or not used. One basically only iterates forwards, processing each element, and one almost never thinks about the position of the element in the sequence. (c)split exists but isn’t commonly used.

Lots of work in a typical language is converting one data type to another, or extracting bits of it, or doing data type-specific things. In shell these don’t really exist (eg you don’t sort as dates, you write your dates like 2019-09-28 and sort them like strings), and so data is simple, and because extraction is flexible (typically a field number or byte positions) lots of the bureaucracy of changing data types is omitted.

These things all often lead to shell scripts being unreliable. But also resilient, flexible and “sufficient”. By “sufficient” I mean that they can do things well enough that the cost for a long term, thorough, or “proper” solution isn’t justified.


Indeed, I don't see why people get so upset (or pedantic) about what are, effectively, NOPs in command-lines.

However, things change if you start adding certain options to sort:-

  sort -m order.*
and

  cat order.* | sort -m
are definitely not the same thing (for most input files at least).

I doubt you'll find many Emacs users that would prefer "C-u M-| sort" over "M-x sort-lines".

Right, but sort is apparently clever enough to do a disk-backed mergesort on a file (according to timr's comment), but it doesn't get a chance to recognize that you're sorting a file if you just pipe in the data.
next

Legal | privacy