Faster and simpler with the command line: deep-comparing JSON files with jq

jb3689 | karma 419 | avg karma 1.99 · 2018-12-06 09:18:07

It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib. I still don't think Ruby is the best choice for doing raw JSON parsing. Last time I had to care about JSON speed we were transforming billions of events and the Ruby JSON lib was becoming a bottleneck

baldeagle | karma 798 | avg karma 2.48 · 2018-12-06 15:39:25+00:00

What do you think is the best way to do deep JSON comparisons? We work with 2GB JSONs all day, and it is super annoying how long they take to process.

Filligree | karma 7814 | avg karma 2.44 · 2018-12-06 15:47:24+00:00

Start with a compiled language, I guess? I don't operate on anywhere near that scale, but json-rust reaches 400 MB/s for me.

It doesn't parallelize, and you'd need memory enough for the entire structure, but of course Rust doesn't have GC overhead. You could trivially parse both files in parallel, at least.

reply

vidarh | karma 41717 | avg karma 2.6 · 2018-12-06 16:02:15+00:00

Not parse them into a tree, to start with.

Use a streaming JSON parser, and compare them token by token unless/until they diverge, at which point you take whatever actual suitable to identify the delta.

Parsing it into a tree may be necessary if you want to do more complex comparisons (such as sorting child objects etc.), but even then depending on your need you may well be better off storing offsets into the file depending on your requirements.

https://github.com/lloyd/yajl is an example of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON is simple enough to write one specifically to handle two streams.

reply

proyb2 | karma 116 | avg karma 0.45 · 2018-12-06 10:15:45

I believe this comparison benchmark could be useful for you and you can expand further with more tests. Although I got downvoted for sharing a link.

https://github.com/kostya/benchmarks/blob/master/README.md

reply

vidarh | karma 41717 | avg karma 2.6 · 2018-12-06 16:31:21+00:00

That still parses into a tree.

gyrgtyn | karma 305 | avg karma 1.69 · 2018-12-06 10:15:54

nodejs comes to mind!

theamk | karma 6529 | avg karma 2.35 · 2018-12-06 10:35:00

(1) Try a language with fast allocations (C, C++, Rust, maybe Go or Java) -- anything except Python or Ruby

or

(2) Try using streaming API (I don't know Ruby, but quick google found https://github.com/dgraham/json-stream ). Note that this method will require you to massively restructure your program -- you want to avoid having all of the data in memory at once.

The streaming API might work better with jq-based preprocessing -- for example, if you want to compare two unsorted sets, it may be faster to sort them using jq, then compare line-by-line using streaming API.

reply

dith3r | karma 6 | avg karma 0.35 · 2018-12-06 11:02:06

Python is fast at parsing JSON, Go had hard time to match parsing speed of it. Additionally you have PyPy to help.

mattbillenstein | karma 1516 | avg karma 1.8 · 2018-12-06 11:20:37

Nod. https://gist.github.com/mattbillenstein/34cf2907390102ffbabd...

Filligree | karma 7814 | avg karma 2.44 · 2018-12-07 14:59:26+00:00

Python is fast at doing anything that doesn't involve running Python.

That's an important caveat. Python's C JSON parser library is super-fast, but if you want to use the data for anything but a simple equality check afterwards, it'll be slow as molasses.

Or you'll write a C extension for it...

reply

nicoburns | karma 22847 | avg karma 3.29 · 2018-12-06 16:43:30+00:00

Rust is absolutely wonderful for tasks like this. They don't hit any of the cases where Rust's ownership can make things tricky. And the serde library makes deserializing JSON a piece of cake.

You end up with code which looks pretty similar to the equivalent JavaScript or Python code, but performs much faster (10x, 100x or even 1000x faster).

reply

nicoburns | karma 22847 | avg karma 3.29 · 2018-12-06 16:44:39+00:00

There's also pikkr (https://github.com/pikkr/pikkr) if you need really really fast JSON parsing.

paulddraper | karma 16686 | avg karma 1.72 · 2018-12-06 16:48:20+00:00

> It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib.

I am on the edge of my seat now.

Would you mind listing which libraries are much (say, an order of magnitude) faster?

reply

rajangdavis | karma 445 | avg karma 1.32 · 2018-12-07 07:20:23+00:00

Don't think it's an order of magnitude faster, but oj is supposed to be the standard for Ruby.

https://github.com/ohler55/oj

reply

proyb2 | karma 116 | avg karma 0.45 · 2018-12-06 15:54:53+00:00

I'm curious how well could Crystal language handles that huge amount of JSON since most of the Ruby code could be ported over to Crystal.

It has a JSON pull parser to minimize memory usage which is useful for memory constraint environment but at the expense of less performant. If that could be split up with fork Crystal processes, I believe it's feasible.

reply

vidarh | karma 41717 | avg karma 2.6 · 2018-12-06 10:06:55

There are stream parsers for JSON for Ruby too, including bindings for C libraries like YAJL - using the default JSON parser is an awful choice for doing comparisons like that given the massive overhead of the amount of objects it'll be creating for no good reason.

proyb2 | karma 116 | avg karma 0.45 · 2018-12-06 16:18:05+00:00

Agree, I believe the benchmark shown Yaji and jq in this repo is useful for you

https://github.com/kostya/benchmarks/blob/master/README.md

reply

CyberMew | karma 11 | avg karma 1.83 · 2018-12-06 10:05:11

I wonder if a C/C++ programm would perform better?

JulianMorrison | karma 10695 | avg karma 3.4 · 2018-12-06 16:11:03+00:00

jq is a C program.

In theory a truly specific program could work better. In practise, the broad scope of jq allows you to discover the operations you need and respond to changes in requirements without being locked into custom code, and any given programmer probably couldn't do the same job better.

reply

cryptonector | karma 9732 | avg karma 1.52 · 2018-12-06 10:38:40

jq is a C program, yes, but jq programs are interpreted. Because jq is a dynamically-typed language, it wouldn't be easy to compile it to object code that would run too much faster than the byte-interpreted version (though it would still run faster).

As you say, jq's power is that it is an expressive language, and it's much much easier to write jq programs that work than it is to write C/C++ programs as needed that do the same or similar work.

reply

JulianMorrison | karma 10695 | avg karma 3.4 · 2018-12-07 13:29:28+00:00

The interpreted language is just the setup phase for a pipeline of compiled-in data transformations.

cryptonector | karma 9732 | avg karma 1.52 · 2018-12-17 22:44:14+00:00

I don't understand this statement. Keep in mind I'm a jq maintainer.

cryptonector | karma 9732 | avg karma 1.52 · 2018-12-06 10:39:03

Yes. If well-written anyways.

oftenwrong | karma 14616 | avg karma 6.35 · 2018-12-06 10:05:24

Another option is to use a tool like 'gron' to convert the JSON into a shell-friendly line-oriented format. This makes the rest straightforward.

https://github.com/tomnomnom/gron

reply

cryptonector | karma 9732 | avg karma 1.52 · 2018-12-06 16:35:59+00:00

jq can do something like that too! Check out its --stream option.

pimlottc | karma 7866 | avg karma 3.44 · 2018-12-06 11:50:00

Oh, this sounds handy, though i can’t imagine it’d be as performant for this particular case.

johnjohnsmith | karma 45 | avg karma 4.09 · 2018-12-06 16:26:29

My problem with this article is the entire strategy for delivering data. A JSON file? That's probably the worst way I could think of: a CSV file would have been better. Part of the reason why databases exist is to handle exactly the problem the author is posing. There are already tools in place in SQL databases that track diffs for TB db's, and the authors could simply export patch files which would be far easier to analyze for inconsistency.

It really bothers me that the article is like "wow, check ou this awesome utility that helps us with a huge problem" instead of really thinking about "how did we get this huge problem and is there already a solution."

It just reeks of inexperience.

reply

orf | karma 17093 | avg karma 4.05 · 2018-12-06 16:30:54

CSV is hardly an easy format to parse, or really produce. There is no CSV standard, and I bet lyrics contain all kinds of weird characters that makes choosing a separator hard.

Newline JSON is a fine interchange format for this, and the only advantage I can see for CSV is you can load it into a database in one command. Which begs the question as to why use a database at all for a simple one-off diff, when there are much more lightweight alternatives (a shell command).

So now you are converting your JSON to CSV to load it into a database to run a bunch of database diffs over it to then compare them in some way. Wouldn't that lead to the question "how did we get this huge problem and is there already a solution"?

Seems like you are the one over complicating things.

And I have to say, choosing CSV and then using a database for this task reeks of inexperience. KISS.

reply

johnjohnsmith | karma 45 | avg karma 4.09 · 2018-12-06 17:10:13+00:00

Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters. You didn't think I actually meant commas did you? I guess your experience with character separated value files is very limited. But my point was to illustrate if record fields are consistent you don't need a heavyweight solution like JSON... and you clearly missed my point.

orf | karma 17093 | avg karma 4.05 · 2018-12-06 17:17:19+00:00

Yes, you're totally correct. Using a heavyweight solution like JSON is beyond the pale, I should use a much more lightweight approach involving a database server.

Your tone is oddly superior in your reply, which is really at odds with the technical content of your messages.

> if record fields are consistent

This is all very confused. The issue is that the JSON fields where not consistent compared to the baseline. So now what? You suggest instead of investing time making them consistent, you should just switch format entirely and then make them consistent? Or are you suggesting that somehow a line of CSV is easier to compare than a line of JSON? Or I should now shove a bunch of non-printing ascii characters in my message and that's now better?

reply

scarejunba | karma 2432 | avg karma 1.07 · 2018-12-06 17:26:28+00:00

Like a SQLite DB? Actually, why don't we just transfer stuff as SQLite DBs. Single file, built-in schema, you can index.

I mean, HDF is super-general and stuff, but it looks like SQLite would solve all the trouble with CSVs.

reply

derefr | karma 53572 | avg karma 3.59 · 2018-12-06 22:57:32

I'm currently implementing an "individual-scale data-warehouse" service (i.e. "Hadoop without the Hadoop part"), and I'm currently pondering between the choices of "a tarball of CSVs", an SQLite file, and an Apache Avro file, as input-side wire formats. (And now HDF5 as well; didn't know about that one.)

I'm still leaning toward "a tarball of CSVs", though:

1. it's very easy to allow different devs to write a bunch of single-purpose Extract tools, each in whatever language is best for the job (e.g. Python if it has to use an API where the only available API-client library impl is in Python) to scrape some particular dimension out of an external source. You can write out CSV data in pretty much any language—even a bash script! That's because, even if the language doesn't have a CSV library, a "trivial" CSV dump can be accomplished by just calling printf(2) with a CSV template string. (Trivial = you know all your stringly-typed data is of constrained formats, such that it doesn't require any quoting. CSV files are trivial more often than you'd think!)

2. Presuming your CSV file is column-ordered to have any primary key(s) first, and that it has no embedded header line, you can "reduce" on the output of a bunch of Extract jobs (i.e. merge-sorting + deduping to produce one CSV dataset) by just feeding all the input files to sort(1) with the -u and -n switches passed. `sort -n -t ','` basically behaves as a very simple streaming CSV parser, while also being amazingly-well-optimized at chewing through on-disk files in parallel. sort(1) is to (local on-disk) CSV data as LevelDB's compaction algorithm is to key-value pair data: a solid primitive that scales with your dataset size.

3. Once you've got two sorted+deduped CSV "snapshot" files, you can create a differential snapshot from them just by calling:

    comm -1 -3 "$old_csv" "$new_csv" > diff.csv

And then, getting an SQL data-migration file out of it (at least for an SQL DB that has something like Postgres's COPY statement, which can read CSV directly) is as simple as:

    cat pre_ddl.sql copy_stmt_begin.sql header.csv diff.csv copy_stmt_end.sql post_ddl.sql > migration.sql

You can then throw that file right into, say, Google Cloud SQL's "import" command.

That being said, the other formats are nice for 1. keeping data like numbers in more compact binary forms, 2. being able to sort and de-dup the data slightly more cheaply, without having to parse anything at point-of-sort. (Though this matters less than you'd think; sort(1)'s minimal parser is very fast, and SQLite/Avro can't get any big access-time wins since the data is neither pre-sorted nor column-oriented.)

But in exchange for this, you lose the ability to cheaply merge working datasets together. You can't just concatenate them—you have to ask your storage-format library to serialize data from one of your data files, and then ask the library to parse and import said data into your other data file. Frequently, the overhead of a complete deep parse of the data is the thing we're trying to avoid the expense of in the first place! (Otherwise, why use an ETL pipeline at all, when you could just have your operational data sources do batched SQL inserts directly to your data warehouse?)

reply

ArchTypical | karma 160 | avg karma 0.9 · 2018-12-06 17:32:43+00:00

> Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters.

That seems unnecessarily condescending. JSON can also mean Janky Serialized Object Notation, but that's not the common case.

> I guess your experience with character separated value files is very limited.

In practice, using something other than a comma is a good solution for some problems, but not others (eg transfer corruption or you know, the OP's use case).

> a heavyweight solution like JSON.

I've literally never heard that phrase, nor does it make much sense. At best it's 2 more characters for wrapping braces with existing quoted data/numbers and at worst you have to make up a new non-interchangeable format as you run into exceptions from the diff, which can affect past encodings. Sounds more involved than using JSON. shrug

reply

dang | karma 18142 | avg karma 0.25 · 2018-12-08 08:21:55+00:00

Did you really manage to turn CSV files into a flamewar topic?

Please review https://news.ycombinator.com/newsguidelines.html and avoid turning nasty in arguments on Hacker News.

reply

mattbillenstein | karma 1516 | avg karma 1.8 · 2018-12-06 16:35:01

You're wrong - json has types - it's very very useful just because of that, and pass newline delimited json through gzip and you basically remove all the size redundant keys...

I think this guy did the right thing for what sounded like essentially a one-off job to test this new export tool. Why would you go to all the trouble to use a SQL database for a one-off thing that can be done using text processing or worst-case writing a small script?

reply

mattbillenstein | karma 1516 | avg karma 1.8 · 2018-12-06 17:21:19+00:00

Evidence: https://gist.github.com/mattbillenstein/34cf2907390102ffbabd...

nrdgrl | karma 68 | avg karma 2.96 · 2018-12-06 16:42:35+00:00

It sounds like apple requires the data in JSON format - they may not have a choice.

jillesvangurp | karma 16646 | avg karma 3.74 · 2018-12-06 10:26:51

If you have json line formatted stuff (or csv) and an aws account, you can do some nice things with Athena and SQL. We have a few simple backoffice tools that I've implemented around simple sql queries on data dumped from various systems that we have in json format. Awesome, if you want to do some quick selects, joins, etc.

If you are going to process this amount of data, don't load it all into memory and process line by line. Also do that concurrently if you have more than one CPU core available. I've done this with ruby, python, Java, and misc shell tools like jq. Use what you are comfortable with and what gets results quickly.

One neat trick with jq is to use it to convert json objects to csv and to then pipe that into csvkit for some quick and dirty sql querying. Generally gets tedious beyond a few hundred MB. I recommend switching to Athena or something similar if that becomes a regular thing for you.

reply

wwarnerandrew | karma 309 | avg karma 20.6 · 2018-12-06 16:41:29+00:00

OP here— good point! We actually use Athena to query these exports in S3 to debug data drift of specific export objects over time. It's quite a useful tool, I was able to go knowing basically nothing about Athena to querying gzipped newline-delimited JSON files in S3 using SQL in about an hour.

totally | karma 179 | avg karma 2.13 · 2018-12-06 16:33:00+00:00

That's cool, you made a thing that verifies the two export jobs you wrote have the same data even though they have different output.

I can't help wondering, if you control the code that generates the JSON, why not output in a conservative, consistent format? I'm sure there are pros/cons, but this work would allow something like `diff` to work, and then you don't have to maintain a separate utility.

reply

wwarnerandrew | karma 309 | avg karma 20.6 · 2018-12-06 16:50:39+00:00

good question! the analysis that I was doing was really a one-off for switching between these processes. We have unit tests and sanity checks to ensure consistency going forward, but as a final check before flipping the switch we wanted to be as confident as possible that we hadn't introduced any regressions across the full data-set.

The new export process is much more reliable and a _lot_ faster, but as a side effect of doing things in a different way it generated the export file in a different format. Given that the order of objects in an export file and the order of keys/etc in the JSON objects didn't matter for anything except comparing the two processes, I figured it was simpler to put the normalization logic in the one-off tool vs baking it into our export process. But certainly if we were maintaining both exports in an ongoing fashion and validating them against each other, it would make a lot more sense to spend time making sure they generated objects and keys in the same order.

reply

ryanisnan | karma 1659 | avg karma 3.69 · 2018-12-06 12:10:18

I'm surprised the layout of the JSON doesn't arrange music by artist, any reasoning why that wasn't done?

wgjordan | karma 1139 | avg karma 4.47 · 2018-12-06 16:57:28+00:00

> My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn't work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?

Key takeaway: next time, start with the second thought first and save yourself well over an hour!

reply

forgot-my-pw | karma 913 | avg karma 2.27 · 2018-12-06 10:58:45

If speed is essential, why not use protobuf/flatbuffer or one of their variants?

gorbachev | karma 3495 | avg karma 3.81 · 2018-12-06 11:05:38

I'm using jsonassert [1], a Java based JSON unit testing library, for something very similar.

Not sure how it'd handle comparing 5GB files though.

1. http://jsonassert.skyscreamer.org/

reply

cphoover | karma 1443 | avg karma 2.37 · 2018-12-06 11:12:19

I'm not sure why you are comparing the data to the old export instead of against a source of truth... for example what is in the upstream data source. Also why not verify using unit tests? Who is to say that the original export is valid and not the second export.

kiallmacinnes | karma 1166 | avg karma 2.4 · 2018-12-06 17:22:26+00:00

In theory, I agree! I hope the new codebase has a set of tests to validate just that.

But, in practice, you have a downstream consumer of this data format (Apple in this case..).. Validating the old and new formats are functionally identical is just as important as validating the new format matches the upstream source of truth :)

reply

yonatank | karma 36 | avg karma 12.0 · 2018-12-06 17:16:00+00:00

Just a heads up to anyone using jq - I've previously spent a couple of hours debugging a problem because jq uses float64 to store integers (which might lead to rounding-errors/overflows). For example:

  echo 1152921504606846976 | jq                                                            
  1152921504606847000

mattbillenstein | karma 1516 | avg karma 1.8 · 2018-12-06 11:19:21

Yikes, that's nasty.

Drdrdrq | karma 1836 | avg karma 1.89 · 2018-12-06 17:50:44+00:00

It is. But it's a problem of JSON itself, not just jq.

iooi | karma 1579 | avg karma 4.9 · 2018-12-06 18:01:44+00:00

JSON != JavaScript

> echo 1152921504606846976 | python -c 'import sys, json; print(json.load(sys.stdin))'

1152921504606846976

reply

wolf550e | karma 4207 | avg karma 3.49 · 2018-12-06 12:15:25

Just because python has bigints and its stdlib json module supports bigints in json doesn't mean that is an interoperable thing to do.

pritambaral | karma 3587 | avg karma 2.13 · 2018-12-06 18:39:17+00:00

Python's json package != JSON

JSON: https://tools.ietf.org/html/rfc8259#page-8

reply

kemitche | karma 968 | avg karma 4.19 · 2018-12-06 19:44:24+00:00

The link says that it's up to the implementation, which means it's valid for Python's JSON implementation to support larger numbers.

It's less "interoperable" but not strictly invalid, by my read.

reply

derefr | karma 53572 | avg karma 3.59 · 2018-12-06 16:17:09

What the GP means is that JSON doesn't require an implementation to decode JSON integers as arbitrary-precision integers, to be "conformant JSON."

Therefore, you can't assume that if you pass some JSON through an arbitrary pipeline of JSON-manipulating tools, written in various languages, that your integer values will be passed through losslessly.

Therefore, you just shouldn't use JSON integer values when you know that the values can potentially be large. This is why e.g. Ethereum's JSON-RPC API uses hex-escaped strings (e.g. "0x0") for representing its "quantity" type.

reply

yonatank | karma 36 | avg karma 12.0 · 2018-12-06 19:54:13+00:00

It looks like JSON doesn't specifically define how numeric numbers should be stored. It just recommends expecting precision up to the double precision limits.

Still interesting to know it's not just a jq quirk.

reply

hashhar | karma 1759 | avg karma 2.65 · 2018-12-06 19:11:15+00:00

That's Python not adhering to the JSON standard as defined in the RFC.

rzzzt | karma 3871 | avg karma 1.97 · 2018-12-06 20:32:33+00:00

The problem is made worse on the receiving end (the browser). I've ran into this issue when serialization libraries in Java send a 64-bit long value as a sequence of digits, then things over ~50 bits get silently truncated, you find out about it, then switch to quoted strings.

Groxx | karma 17784 | avg karma 2.5 · 2018-12-06 19:19:28+00:00

TIL: JSON has no specified number implementation: http://www.ecma-international.org/publications/files/ECMA-ST...

>JSON is agnostic about the semantics of numbers ... JSON instead offers only the representation of numbers that humans use: a sequence of digits.

So... anything is valid, per the spec.

reply

otterley | karma 9624 | avg karma 2.68 · 2018-12-06 11:19:36

This is an artifact of JavaScript, which even as of ES6 uses IEEE 754 double-precision floats for all numeric values. jq likely uses the same implementation internally for compatibility reasons and to avoid surprises of a different kind.

See https://www.ecma-international.org/ecma-262/6.0/#sec-ecmascr...

reply

int0x80 | karma 488 | avg karma 1.86 · 2018-12-07 09:16:13+00:00

I think an 'error out if overflow/truncation'-mode available as a command line flag could be useful if you just don't have any JS involved in the JSON pipeline.

BrendanEich | karma 3961 | avg karma 3.05 · 2018-12-17 19:44:40+00:00

BigInt in top browsers now, not under a flag. Just sayin'!

https://brendaneich.com/wp-content/uploads/2017/12/dotJS-201... et seq.

reply

aditya | karma 7523 | avg karma 6.23 · 2018-12-07 03:01:24+00:00

Not jq but JS itself.

Here's the epic jwz rant on it: https://www.jwz.org/blog/2010/10/every-day-i-learn-something...

with Brendan Eich chiming in: https://www.jwz.org/blog/2010/10/every-day-i-learn-something...

reply

kibibu | karma 1869 | avg karma 2.66 · 2018-12-07 03:09:33+00:00

Twitter had to switch their tweet id representation in the API to handle this - it was numeric and switched to strings.

chokma | karma 263 | avg karma 2.05 · 2018-12-07 09:41:04+00:00

Note: the links currently redirect to https://imgur.com/32R3qLv (image of a testicle and derogatory comment on HN)

baud147258 | karma 3685 | avg karma 1.46 · 2018-12-07 16:17:09+00:00

Copy-pasting the link bypass this. It's using the HN referrer (I think?) to redirect to imgur

mattbillenstein | karma 1516 | avg karma 1.8 · 2018-12-06 17:17:32+00:00

Note: use python3!

https://gist.github.com/mattbillenstein/34cf2907390102ffbabd...

reply

g-harel | karma 33 | avg karma 1.57 · 2018-12-06 11:25:00

I had a similar problem diffing large API responses a few months ago and implemented an automation friendly JSON schema tool. It's a great way to make a summary of the data, especially when looking for forgotten fields for example.

https://github.com/g-harel/ence

reply

pjungwir | karma 4118 | avg karma 3.27 · 2018-12-06 12:10:45

I really enjoyed this article, and I think it shows how successfully jq fits into the Unix culture of sed/awk/grep/etc. It seems so rare to find new CLUI tools that feel as "classical" as jq. It has helped me do one-off tasks like this several times, but I've really only scratched the surface. Often with newer tools I'm reluctant to invest in going deeper into really learning the features, but with jq I have a lot of confidence that it would pay off for years to come. I don't see any books yet, but this is about the phase where I'd normally buy an O'Reilly volume (hint hint).

Btw I'm surprised you needed -M, since I thought jq would suppress colors if it saw it wasn't writing to a tty.

reply

jeroenjanssens | karma 728 | avg karma 7.66 · 2018-12-06 13:15:16

Chapter 5 of Data Science at the Command Line (O'Reilly, 2014) mentions `jq` briefly: https://www.datascienceatthecommandline.com/chapter-5-scrubb...

sigi45 | karma 818 | avg karma 1.25 · 2018-12-06 21:06:47+00:00

I had the opposite thought when I used it for the first time.

Even when reading the article I thought about it :)

reply

GordonS | karma 20012 | avg karma 3.02 · 2018-12-06 18:17:42+00:00

> What’s the best way to compare these two 5GB files?

A much simpler way to do this is simply to hash the files, for example using sha256sum, which AFAIK ships with just about every Linux distro. Then just compare the hashes.

reply

hannasanarion | karma 4380 | avg karma 2.98 · 2018-12-06 13:28:51

Having the same content is not the same as being identical verbatim.

gerbilly | karma 6789 | avg karma 3.94 · 2018-12-06 19:07:37+00:00

I wanted to like jq, but honestly, I can't figure out it's crazy syntax.

rane | karma 459 | avg karma 2.25 · 2018-12-06 20:16:56+00:00

If you're into JavaScript (or LiveScript) or functional programming, you might find ramda-cli[1] more palatable. Disclaimer: I've created it.

[1]: https://github.com/raine/ramda-cli

reply

Rotten194 | karma 909 | avg karma 3.64 · 2018-12-06 21:58:41+00:00

it's a bit weird, but it's perfect for writing quick shell one liners with once you get used to it. no regular scripting language can match it for that

saagarjha | karma 56017 | avg karma 2.29 · 2018-12-07 00:43:27+00:00

awk, sed?

pathseeker | karma 2300 | avg karma 3.32 · 2018-12-07 00:49:52+00:00

For extracting a value out of json where the keys can arbitrarily re-order and you have nested maps containing the same key names? No thanks.

saagarjha | karma 56017 | avg karma 2.29 · 2018-12-06 21:22:09

I was providing them as examples of things "perfect for writing quick shell one liners", not good JSON parsers.

gerbilly | karma 6789 | avg karma 3.94 · 2018-12-07 13:27:07+00:00

The braces make that more difficult, awx and sed can work with YAML more easily, but awk works best on column oriented data, and YAML and JSON are more row oriented.

specialist | karma 10439 | avg karma 1.48 · 2018-12-06 22:15:42+00:00

I love jq.

I replaced a bunch of bespoke ETL code with shell scripts. grep, sed, jq, xsv, psql, etc. Fast, efficient, iterative, inspectable, portable.

Alas, most everyone else insists on python, nodejs, ruby, AWS Lambda, jenkins goo, misc mayfly tech stacks. So my "use the most simple tool that works" advocacy has never gained traction.

reply

nrclark | karma 2096 | avg karma 3.97 · 2018-12-06 22:30:17+00:00

+1 for a great tool.

evancordell | karma 176 | avg karma 2.17 · 2018-12-06 22:43:15+00:00

This is somewhat related to a hack I threw together recently: https://github.com/ecordell/jf

It attempts to address a similar problem (comparing json or subsets of json), but I wanted the structure of what was being compared to be more readable (compared to jq), so I went with graphql syntax. Doubt it would do great on larger datasets though.

reply

ComputerGuru | karma 29702 | avg karma 4.11 · 2018-12-07 01:41:56+00:00

As part of an automated Jira upgrade script (well, Makefile) we needed to export changes to the listener port/scheme configuration which is unfortunately stored “in the code” so to speak (in WEB-INF/web.xml) which Atlassian doesn’t deign to whitespace normally (indentation is all over the place, as is formatting, character encoding, and more) —- and they mangle it differently somehow with each point release. So the Makefile calls xmllint to normalize formatting and whitespace of both the untouched source files from the old and new release as well as the locally modified (deployed) configuration, then calls diff/patch accordingly (in a three-way).

zoom6628 | karma 657 | avg karma 2.02 · 2018-12-06 23:27:09

I have found jq immensely useful to process ugly large responses from REST APIs in enterprise systems. It's like an awk for JSON... And I've been awk fan for 30 years for any text processing.