Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Faster and simpler with the command line: deep-comparing JSON files with jq (genius.engineering) similar stories update story
206 points by wwarnerandrew | karma 309 | avg karma 20.6 2018-12-06 07:28:42 | hide | past | favorite | 93 comments



view as:

It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib. I still don't think Ruby is the best choice for doing raw JSON parsing. Last time I had to care about JSON speed we were transforming billions of events and the Ruby JSON lib was becoming a bottleneck

What do you think is the best way to do deep JSON comparisons? We work with 2GB JSONs all day, and it is super annoying how long they take to process.

Start with a compiled language, I guess? I don't operate on anywhere near that scale, but json-rust reaches 400 MB/s for me.

It doesn't parallelize, and you'd need memory enough for the entire structure, but of course Rust doesn't have GC overhead. You could trivially parse both files in parallel, at least.


Not parse them into a tree, to start with.

Use a streaming JSON parser, and compare them token by token unless/until they diverge, at which point you take whatever actual suitable to identify the delta.

Parsing it into a tree may be necessary if you want to do more complex comparisons (such as sorting child objects etc.), but even then depending on your need you may well be better off storing offsets into the file depending on your requirements.

https://github.com/lloyd/yajl is an example of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON is simple enough to write one specifically to handle two streams.


I believe this comparison benchmark could be useful for you and you can expand further with more tests. Although I got downvoted for sharing a link.

https://github.com/kostya/benchmarks/blob/master/README.md


That still parses into a tree.

nodejs comes to mind!

(1) Try a language with fast allocations (C, C++, Rust, maybe Go or Java) -- anything except Python or Ruby

or

(2) Try using streaming API (I don't know Ruby, but quick google found https://github.com/dgraham/json-stream ). Note that this method will require you to massively restructure your program -- you want to avoid having all of the data in memory at once.

The streaming API might work better with jq-based preprocessing -- for example, if you want to compare two unsorted sets, it may be faster to sort them using jq, then compare line-by-line using streaming API.


Python is fast at parsing JSON, Go had hard time to match parsing speed of it. Additionally you have PyPy to help.


Python is fast at doing anything that doesn't involve running Python.

That's an important caveat. Python's C JSON parser library is super-fast, but if you want to use the data for anything but a simple equality check afterwards, it'll be slow as molasses.

Or you'll write a C extension for it...


Rust is absolutely wonderful for tasks like this. They don't hit any of the cases where Rust's ownership can make things tricky. And the serde library makes deserializing JSON a piece of cake.

You end up with code which looks pretty similar to the equivalent JavaScript or Python code, but performs much faster (10x, 100x or even 1000x faster).


There's also pikkr (https://github.com/pikkr/pikkr) if you need really really fast JSON parsing.

> It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib.

I am on the edge of my seat now.

Would you mind listing which libraries are much (say, an order of magnitude) faster?


Don't think it's an order of magnitude faster, but oj is supposed to be the standard for Ruby.

https://github.com/ohler55/oj


I'm curious how well could Crystal language handles that huge amount of JSON since most of the Ruby code could be ported over to Crystal.

It has a JSON pull parser to minimize memory usage which is useful for memory constraint environment but at the expense of less performant. If that could be split up with fork Crystal processes, I believe it's feasible.


There are stream parsers for JSON for Ruby too, including bindings for C libraries like YAJL - using the default JSON parser is an awful choice for doing comparisons like that given the massive overhead of the amount of objects it'll be creating for no good reason.

Agree, I believe the benchmark shown Yaji and jq in this repo is useful for you

https://github.com/kostya/benchmarks/blob/master/README.md


I wonder if a C/C++ programm would perform better?

jq is a C program.

In theory a truly specific program could work better. In practise, the broad scope of jq allows you to discover the operations you need and respond to changes in requirements without being locked into custom code, and any given programmer probably couldn't do the same job better.


jq is a C program, yes, but jq programs are interpreted. Because jq is a dynamically-typed language, it wouldn't be easy to compile it to object code that would run too much faster than the byte-interpreted version (though it would still run faster).

As you say, jq's power is that it is an expressive language, and it's much much easier to write jq programs that work than it is to write C/C++ programs as needed that do the same or similar work.


The interpreted language is just the setup phase for a pipeline of compiled-in data transformations.

I don't understand this statement. Keep in mind I'm a jq maintainer.

Yes. If well-written anyways.

Another option is to use a tool like 'gron' to convert the JSON into a shell-friendly line-oriented format. This makes the rest straightforward.

https://github.com/tomnomnom/gron


jq can do something like that too! Check out its --stream option.

Oh, this sounds handy, though i can’t imagine it’d be as performant for this particular case.

My problem with this article is the entire strategy for delivering data. A JSON file? That's probably the worst way I could think of: a CSV file would have been better. Part of the reason why databases exist is to handle exactly the problem the author is posing. There are already tools in place in SQL databases that track diffs for TB db's, and the authors could simply export patch files which would be far easier to analyze for inconsistency.

It really bothers me that the article is like "wow, check ou this awesome utility that helps us with a huge problem" instead of really thinking about "how did we get this huge problem and is there already a solution."

It just reeks of inexperience.


CSV is hardly an easy format to parse, or really produce. There is no CSV standard, and I bet lyrics contain all kinds of weird characters that makes choosing a separator hard.

Newline JSON is a fine interchange format for this, and the only advantage I can see for CSV is you can load it into a database in one command. Which begs the question as to why use a database at all for a simple one-off diff, when there are much more lightweight alternatives (a shell command).

So now you are converting your JSON to CSV to load it into a database to run a bunch of database diffs over it to then compare them in some way. Wouldn't that lead to the question "how did we get this huge problem and is there already a solution"?

Seems like you are the one over complicating things.

And I have to say, choosing CSV and then using a database for this task reeks of inexperience. KISS.


Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters. You didn't think I actually meant commas did you? I guess your experience with character separated value files is very limited. But my point was to illustrate if record fields are consistent you don't need a heavyweight solution like JSON... and you clearly missed my point.

Yes, you're totally correct. Using a heavyweight solution like JSON is beyond the pale, I should use a much more lightweight approach involving a database server.

Your tone is oddly superior in your reply, which is really at odds with the technical content of your messages.

> if record fields are consistent

This is all very confused. The issue is that the JSON fields where not consistent compared to the baseline. So now what? You suggest instead of investing time making them consistent, you should just switch format entirely and then make them consistent? Or are you suggesting that somehow a line of CSV is easier to compare than a line of JSON? Or I should now shove a bunch of non-printing ascii characters in my message and that's now better?


Like a SQLite DB? Actually, why don't we just transfer stuff as SQLite DBs. Single file, built-in schema, you can index.

I mean, HDF is super-general and stuff, but it looks like SQLite would solve all the trouble with CSVs.


I'm currently implementing an "individual-scale data-warehouse" service (i.e. "Hadoop without the Hadoop part"), and I'm currently pondering between the choices of "a tarball of CSVs", an SQLite file, and an Apache Avro file, as input-side wire formats. (And now HDF5 as well; didn't know about that one.)

I'm still leaning toward "a tarball of CSVs", though:

1. it's very easy to allow different devs to write a bunch of single-purpose Extract tools, each in whatever language is best for the job (e.g. Python if it has to use an API where the only available API-client library impl is in Python) to scrape some particular dimension out of an external source. You can write out CSV data in pretty much any language—even a bash script! That's because, even if the language doesn't have a CSV library, a "trivial" CSV dump can be accomplished by just calling printf(2) with a CSV template string. (Trivial = you know all your stringly-typed data is of constrained formats, such that it doesn't require any quoting. CSV files are trivial more often than you'd think!)

2. Presuming your CSV file is column-ordered to have any primary key(s) first, and that it has no embedded header line, you can "reduce" on the output of a bunch of Extract jobs (i.e. merge-sorting + deduping to produce one CSV dataset) by just feeding all the input files to sort(1) with the -u and -n switches passed. `sort -n -t ','` basically behaves as a very simple streaming CSV parser, while also being amazingly-well-optimized at chewing through on-disk files in parallel. sort(1) is to (local on-disk) CSV data as LevelDB's compaction algorithm is to key-value pair data: a solid primitive that scales with your dataset size.

3. Once you've got two sorted+deduped CSV "snapshot" files, you can create a differential snapshot from them just by calling:

    comm -1 -3 "$old_csv" "$new_csv" > diff.csv
And then, getting an SQL data-migration file out of it (at least for an SQL DB that has something like Postgres's COPY statement, which can read CSV directly) is as simple as:

    cat pre_ddl.sql copy_stmt_begin.sql header.csv diff.csv copy_stmt_end.sql post_ddl.sql > migration.sql
You can then throw that file right into, say, Google Cloud SQL's "import" command.

That being said, the other formats are nice for 1. keeping data like numbers in more compact binary forms, 2. being able to sort and de-dup the data slightly more cheaply, without having to parse anything at point-of-sort. (Though this matters less than you'd think; sort(1)'s minimal parser is very fast, and SQLite/Avro can't get any big access-time wins since the data is neither pre-sorted nor column-oriented.)

But in exchange for this, you lose the ability to cheaply merge working datasets together. You can't just concatenate them—you have to ask your storage-format library to serialize data from one of your data files, and then ask the library to parse and import said data into your other data file. Frequently, the overhead of a complete deep parse of the data is the thing we're trying to avoid the expense of in the first place! (Otherwise, why use an ETL pipeline at all, when you could just have your operational data sources do batched SQL inserts directly to your data warehouse?)


> Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters.

That seems unnecessarily condescending. JSON can also mean Janky Serialized Object Notation, but that's not the common case.

> I guess your experience with character separated value files is very limited.

In practice, using something other than a comma is a good solution for some problems, but not others (eg transfer corruption or you know, the OP's use case).

> a heavyweight solution like JSON.

I've literally never heard that phrase, nor does it make much sense. At best it's 2 more characters for wrapping braces with existing quoted data/numbers and at worst you have to make up a new non-interchangeable format as you run into exceptions from the diff, which can affect past encodings. Sounds more involved than using JSON. shrug


Did you really manage to turn CSV files into a flamewar topic?

Please review https://news.ycombinator.com/newsguidelines.html and avoid turning nasty in arguments on Hacker News.


You're wrong - json has types - it's very very useful just because of that, and pass newline delimited json through gzip and you basically remove all the size redundant keys...

I think this guy did the right thing for what sounded like essentially a one-off job to test this new export tool. Why would you go to all the trouble to use a SQL database for a one-off thing that can be done using text processing or worst-case writing a small script?



It sounds like apple requires the data in JSON format - they may not have a choice.

If you have json line formatted stuff (or csv) and an aws account, you can do some nice things with Athena and SQL. We have a few simple backoffice tools that I've implemented around simple sql queries on data dumped from various systems that we have in json format. Awesome, if you want to do some quick selects, joins, etc.

If you are going to process this amount of data, don't load it all into memory and process line by line. Also do that concurrently if you have more than one CPU core available. I've done this with ruby, python, Java, and misc shell tools like jq. Use what you are comfortable with and what gets results quickly.

One neat trick with jq is to use it to convert json objects to csv and to then pipe that into csvkit for some quick and dirty sql querying. Generally gets tedious beyond a few hundred MB. I recommend switching to Athena or something similar if that becomes a regular thing for you.


OP here— good point! We actually use Athena to query these exports in S3 to debug data drift of specific export objects over time. It's quite a useful tool, I was able to go knowing basically nothing about Athena to querying gzipped newline-delimited JSON files in S3 using SQL in about an hour.

That's cool, you made a thing that verifies the two export jobs you wrote have the same data even though they have different output.

I can't help wondering, if you control the code that generates the JSON, why not output in a conservative, consistent format? I'm sure there are pros/cons, but this work would allow something like `diff` to work, and then you don't have to maintain a separate utility.


good question! the analysis that I was doing was really a one-off for switching between these processes. We have unit tests and sanity checks to ensure consistency going forward, but as a final check before flipping the switch we wanted to be as confident as possible that we hadn't introduced any regressions across the full data-set.

The new export process is much more reliable and a _lot_ faster, but as a side effect of doing things in a different way it generated the export file in a different format. Given that the order of objects in an export file and the order of keys/etc in the JSON objects didn't matter for anything except comparing the two processes, I figured it was simpler to put the normalization logic in the one-off tool vs baking it into our export process. But certainly if we were maintaining both exports in an ongoing fashion and validating them against each other, it would make a lot more sense to spend time making sure they generated objects and keys in the same order.


I'm surprised the layout of the JSON doesn't arrange music by artist, any reasoning why that wasn't done?

> My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn't work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?

Key takeaway: next time, start with the second thought first and save yourself well over an hour!


If speed is essential, why not use protobuf/flatbuffer or one of their variants?

I'm using jsonassert [1], a Java based JSON unit testing library, for something very similar.

Not sure how it'd handle comparing 5GB files though.

1. http://jsonassert.skyscreamer.org/


I'm not sure why you are comparing the data to the old export instead of against a source of truth... for example what is in the upstream data source. Also why not verify using unit tests? Who is to say that the original export is valid and not the second export.

In theory, I agree! I hope the new codebase has a set of tests to validate just that.

But, in practice, you have a downstream consumer of this data format (Apple in this case..).. Validating the old and new formats are functionally identical is just as important as validating the new format matches the upstream source of truth :)


Just a heads up to anyone using jq - I've previously spent a couple of hours debugging a problem because jq uses float64 to store integers (which might lead to rounding-errors/overflows). For example:

  echo 1152921504606846976 | jq                                                            
  1152921504606847000

Yikes, that's nasty.

It is. But it's a problem of JSON itself, not just jq.

JSON != JavaScript

> echo 1152921504606846976 | python -c 'import sys, json; print(json.load(sys.stdin))'

1152921504606846976


Just because python has bigints and its stdlib json module supports bigints in json doesn't mean that is an interoperable thing to do.

Python's json package != JSON

JSON: https://tools.ietf.org/html/rfc8259#page-8


The link says that it's up to the implementation, which means it's valid for Python's JSON implementation to support larger numbers.

It's less "interoperable" but not strictly invalid, by my read.


What the GP means is that JSON doesn't require an implementation to decode JSON integers as arbitrary-precision integers, to be "conformant JSON."

Therefore, you can't assume that if you pass some JSON through an arbitrary pipeline of JSON-manipulating tools, written in various languages, that your integer values will be passed through losslessly.

Therefore, you just shouldn't use JSON integer values when you know that the values can potentially be large. This is why e.g. Ethereum's JSON-RPC API uses hex-escaped strings (e.g. "0x0") for representing its "quantity" type.


It looks like JSON doesn't specifically define how numeric numbers should be stored. It just recommends expecting precision up to the double precision limits.

Still interesting to know it's not just a jq quirk.


That's Python not adhering to the JSON standard as defined in the RFC.

The problem is made worse on the receiving end (the browser). I've ran into this issue when serialization libraries in Java send a 64-bit long value as a sequence of digits, then things over ~50 bits get silently truncated, you find out about it, then switch to quoted strings.

TIL: JSON has no specified number implementation: http://www.ecma-international.org/publications/files/ECMA-ST...

>JSON is agnostic about the semantics of numbers ... JSON instead offers only the representation of numbers that humans use: a sequence of digits.

So... anything is valid, per the spec.


This is an artifact of JavaScript, which even as of ES6 uses IEEE 754 double-precision floats for all numeric values. jq likely uses the same implementation internally for compatibility reasons and to avoid surprises of a different kind.

See https://www.ecma-international.org/ecma-262/6.0/#sec-ecmascr...


I think an 'error out if overflow/truncation'-mode available as a command line flag could be useful if you just don't have any JS involved in the JSON pipeline.

BigInt in top browsers now, not under a flag. Just sayin'!

https://brendaneich.com/wp-content/uploads/2017/12/dotJS-201... et seq.



Twitter had to switch their tweet id representation in the API to handle this - it was numeric and switched to strings.

Note: the links currently redirect to https://imgur.com/32R3qLv (image of a testicle and derogatory comment on HN)

Copy-pasting the link bypass this. It's using the HN referrer (I think?) to redirect to imgur


I had a similar problem diffing large API responses a few months ago and implemented an automation friendly JSON schema tool. It's a great way to make a summary of the data, especially when looking for forgotten fields for example.

https://github.com/g-harel/ence


I really enjoyed this article, and I think it shows how successfully jq fits into the Unix culture of sed/awk/grep/etc. It seems so rare to find new CLUI tools that feel as "classical" as jq. It has helped me do one-off tasks like this several times, but I've really only scratched the surface. Often with newer tools I'm reluctant to invest in going deeper into really learning the features, but with jq I have a lot of confidence that it would pay off for years to come. I don't see any books yet, but this is about the phase where I'd normally buy an O'Reilly volume (hint hint).

Btw I'm surprised you needed -M, since I thought jq would suppress colors if it saw it wasn't writing to a tty.


Chapter 5 of Data Science at the Command Line (O'Reilly, 2014) mentions `jq` briefly: https://www.datascienceatthecommandline.com/chapter-5-scrubb...

I had the opposite thought when I used it for the first time.

Even when reading the article I thought about it :)


> What’s the best way to compare these two 5GB files?

A much simpler way to do this is simply to hash the files, for example using sha256sum, which AFAIK ships with just about every Linux distro. Then just compare the hashes.


Having the same content is not the same as being identical verbatim.

I wanted to like jq, but honestly, I can't figure out it's crazy syntax.

If you're into JavaScript (or LiveScript) or functional programming, you might find ramda-cli[1] more palatable. Disclaimer: I've created it.

[1]: https://github.com/raine/ramda-cli


it's a bit weird, but it's perfect for writing quick shell one liners with once you get used to it. no regular scripting language can match it for that

awk, sed?

For extracting a value out of json where the keys can arbitrarily re-order and you have nested maps containing the same key names? No thanks.

I was providing them as examples of things "perfect for writing quick shell one liners", not good JSON parsers.

The braces make that more difficult, awx and sed can work with YAML more easily, but awk works best on column oriented data, and YAML and JSON are more row oriented.

I love jq.

I replaced a bunch of bespoke ETL code with shell scripts. grep, sed, jq, xsv, psql, etc. Fast, efficient, iterative, inspectable, portable.

Alas, most everyone else insists on python, nodejs, ruby, AWS Lambda, jenkins goo, misc mayfly tech stacks. So my "use the most simple tool that works" advocacy has never gained traction.


+1 for a great tool.

This is somewhat related to a hack I threw together recently: https://github.com/ecordell/jf

It attempts to address a similar problem (comparing json or subsets of json), but I wanted the structure of what was being compared to be more readable (compared to jq), so I went with graphql syntax. Doubt it would do great on larger datasets though.


As part of an automated Jira upgrade script (well, Makefile) we needed to export changes to the listener port/scheme configuration which is unfortunately stored “in the code” so to speak (in WEB-INF/web.xml) which Atlassian doesn’t deign to whitespace normally (indentation is all over the place, as is formatting, character encoding, and more) —- and they mangle it differently somehow with each point release. So the Makefile calls xmllint to normalize formatting and whitespace of both the untouched source files from the old and new release as well as the locally modified (deployed) configuration, then calls diff/patch accordingly (in a three-way).

I have found jq immensely useful to process ugly large responses from REST APIs in enterprise systems. It's like an awk for JSON... And I've been awk fan for 30 years for any text processing.

Legal | privacy