RX – a new random-access JSON alternative (github.com)

by creationix 108 comments 146 points
Read article View on HN

108 comments

[−] btown 58d ago
This is really interesting. At first glance, I was tempted to say "why not just use sqlite with JSON fields as the transfer format?" But everything about that would be heavier-weight in every possible way - and if I'm reading things right, this handles nested data that might itself be massive. This is really elegant.

My one eyebrow raise is - is there no binary format specification? https://github.com/creationix/rx/blob/main/rx.ts#L1109 is pretty well commented, but you can't call it a JSON alternative without having some kind of equivalent to https://www.json.org/ in all its flowchart glory!

[−] creationix 58d ago
Thanks. I had this for older versions, but forgot to write it up again for the latest version.

One old version that is meant to be more human readable/writable is jsonito

https://github.com/creationix/jsonito

I'll add similar diagrams and docs for the format itself here.

[−] creationix 58d ago
Initial format docs are now here:

https://github.com/creationix/rx/blob/main/docs/rx-format.md

Railroad diagrams will come later when I have more time.

[−] btown 57d ago
Neat! In case you took me too literally: railroad diagrams are fun, but far from the only way to give spec level clarity, so don’t feel you need to overindex on my silly comment!

I am curious why it’s parsed right to left. Is this so that you could add new data to a top-level JSONL-esque list, solely by rewriting the end of the data structure, and not needing to change the beginning (or worst-case shift every single byte of data, if you need a longer count)?

It’s an interesting design tradeoff, because you can’t show a partial parse if you’re streaming the content naively beginning to end, which is a bit odd in a world where streams that begin to render token-by-token are all the rage.

But if you have an ability to do range queries, it’s quite effective, and it does allow for those incremental updates!

[−] creationix 57d ago
Tha main reason for the reverse encoding is it makes it easier on the writer. You simply do a depth-first traversal of the data graph and emit data on the way back up the stack. Zero buffering is needed since this naturally means you write contents before the length prefix.

But it does open up a future direction I want to make with mutable datasets using append-only persistent data structures. The chain primitive is currently only used for strings, but it will be used to do the equivalent of {...oldObj, ...newObj} as a single chain (pointerToOldObj, newObj).

With chains and pointers, you can write new versions of a dataset and reuse all the existing values that are unchanged. This, combined with random-access reads and fixed-block caching makes for a fairly complete MVCC database.

[−] creationix 57d ago
And don't worry about railroad diagrams. I already intended to create them, I've just been extra busy this week with other things.
[−] Levitating 58d ago
JSON is human-readable, why even compare it with this. Is any serialization format now just a "JSON alternative"?
[−] jy14898 58d ago
Came to the same conclusion the moment I had to hunt to see the outputs https://github.com/creationix/rx/tree/main/samples
[−] SV_BubbleTime 58d ago
I was instantly suspicious that a “new better format” for serialization didn’t open with the input/output. And this is why (fucking lol, gtfo):

    Q^mSat,3^b:d+s+E,4Fri,3^u:h+k+u,6Thu,3^P:j+
If you are effectively going binary, do it. CBOR or Protobuf or any dozen other binary serializations that would be far more efficient.

The author claims this is because of copy and pasting… cool, remind me what BASE64 is again?

[−] creationix 58d ago
- this encodes to ASCII text (unless your strings contain unicode themselves) - that means you can copy-paste it (good luck doing that with compressed JSON or CBOR or SQLite - there is a scale where JSON isn't human readable anymore. I've seen files that are 100+MB of minified JSON all on a single very long line. No human is reading that without using some tooling.
[−] creationix 58d ago
Thanks for the feedback. I've improved the framing to make the purpose/value more clear. What do you think about "RX is a read-only embedded store for JSON-shaped data"?

https://www.npmjs.com/package/@creationix/rx

[−] Gormo 58d ago
It's also quite odd to create a serialization format optimized for random access.
[−] dietr1ch 58d ago
cat file.whatever | whatever2json | jq ?

(Or to avoid using cat to read, whatever2json file.whatever | jq)

[−] garrettjoecox 58d ago
Very cool stuff!

This did catch my eye, however: https://github.com/creationix/rx?tab=readme-ov-file#proxy-be...

While this is a neat feature, this means it is not in fact a drop in replacement for JSON.parse, as you will be breaking any code that relies on the that result being a mutable object.

[−] dtech 58d ago
It's not quite clear to me why you'd use this over something more established such as protobuf, thrift, flatbuffers, cap n proto etc.
[−] barishnamazov 58d ago
You shouldn't be using JSON for things that'd have performance implications.
[−] Spivak 58d ago
I love these projects, I hope one of them someday emerges as the winner because (as it motivates all these libraries' authors) there's so much low hanging fruit and free wins changing the line format for JSON but keeping the "Good Parts" like the dead simple generic typing.

XML has EXI (Efficient XML Interchange) for precisely the reason of getting wins over the wire but keeping the nice human readable format at the ends.

[−] benatkin 58d ago
Interesting. I've heard about cursors in reference to a Rust library that was mentioned as being similar to protobuf and cap'n proto.

Does this duplicate the name of keys? Say if you have a thousand plain objects in an array, each with a "version" key, would the string "version" be duplicated a thousand times?

Another project a lot of people aren't aware of even though they've benefitted from it indirectly is the binary format for OpenStreetMap. It allows reading the data without loading a lot of it into memory, and is a lot faster than using sqlite would be.

Edit: the rust library I remember may have been https://rkyv.org/

[−] creationix 59d ago
A new random-access JSON alternative from the creator of nvm.sh, luvit.io, and js-git.
[−] 50lo 58d ago
The biggest challenge for formats like this is usually tooling. JSON won largely because: every language supports it, every tool understands it.

Even a technically superior format struggles without that ecosystem.

[−] DaleBiagio 58d ago
JSON's dominance is one of the most accidental success stories in computing.

Douglas Crockford didn't design it — he said he "discovered" it. It was already there in JavaScript's object literal syntax, which itself traces back to Brendan Eich's 10-day sprint in 1995.

A data format that conquered the internet was a side effect of a language built under absurd time pressure.

Every attempt to replace it has to overcome that kind of accidental ubiquity, which is much harder than overcoming a technical limitation.

[−] dietr1ch 58d ago
A tiny note on the speed comparison: The 23,000x faster single-key lookup seems a bit misleading to me.

Once you get the computational complexity advantage, then you can make it as much times faster as you want. In these cases small instances matter to judge constants, and to the average (mean?) user, mean instance sizes.

I'm not sure how to sell the advantage succinctly though. Maybe just focus on "real-world" scenarios, but there's no footnote with details on the comparison

[−] jbverschoor 58d ago
So this is two things? A BSON-like encoding + something similar to implementing random access / tree walker using streaming JSON?

Docs are super unclear.

[−] _flux 58d ago
It doesn't seem the actual serialization format is specified? Other than in the code that is.

Is it versioned? Or does it need to be..

[−] killbot5000 58d ago
The documentation reference a “decode” function, and it’s imported to the example code, but it’s never called. I’m not sure what the API is after reading the examples.
[−] transfire 58d ago
I am a little confused. Is this still JSON? Is it “binary“ JSON?
[−] pshirshov 58d ago