This is really interesting. At first glance, I was tempted to say "why not just use sqlite with JSON fields as the transfer format?" But everything about that would be heavier-weight in every possible way - and if I'm reading things right, this handles nested data that might itself be massive. This is really elegant.
Neat! In case you took me too literally: railroad diagrams are fun, but far from the only way to give spec level clarity, so don’t feel you need to overindex on my silly comment!
I am curious why it’s parsed right to left. Is this so that you could add new data to a top-level JSONL-esque list, solely by rewriting the end of the data structure, and not needing to change the beginning (or worst-case shift every single byte of data, if you need a longer count)?
It’s an interesting design tradeoff, because you can’t show a partial parse if you’re streaming the content naively beginning to end, which is a bit odd in a world where streams that begin to render token-by-token are all the rage.
But if you have an ability to do range queries, it’s quite effective, and it does allow for those incremental updates!
Tha main reason for the reverse encoding is it makes it easier on the writer. You simply do a depth-first traversal of the data graph and emit data on the way back up the stack. Zero buffering is needed since this naturally means you write contents before the length prefix.
But it does open up a future direction I want to make with mutable datasets using append-only persistent data structures. The chain primitive is currently only used for strings, but it will be used to do the equivalent of {...oldObj, ...newObj} as a single chain (pointerToOldObj, newObj).
With chains and pointers, you can write new versions of a dataset and reuse all the existing values that are unchanged. This, combined with random-access reads and fixed-block caching makes for a fairly complete MVCC database.
- this encodes to ASCII text (unless your strings contain unicode themselves)
- that means you can copy-paste it (good luck doing that with compressed JSON or CBOR or SQLite
- there is a scale where JSON isn't human readable anymore. I've seen files that are 100+MB of minified JSON all on a single very long line. No human is reading that without using some tooling.
Thanks for the feedback. I've improved the framing to make the purpose/value more clear. What do you think about "RX is a read-only embedded store for JSON-shaped data"?
While this is a neat feature, this means it is not in fact a drop in replacement for JSON.parse, as you will be breaking any code that relies on the that result being a mutable object.
I love these projects, I hope one of them someday emerges as the winner because (as it motivates all these libraries' authors) there's so much low hanging fruit and free wins changing the line format for JSON but keeping the "Good Parts" like the dead simple generic typing.
XML has EXI (Efficient XML Interchange) for precisely the reason of getting wins over the wire but keeping the nice human readable format at the ends.
Interesting. I've heard about cursors in reference to a Rust library that was mentioned as being similar to protobuf and cap'n proto.
Does this duplicate the name of keys? Say if you have a thousand plain objects in an array, each with a "version" key, would the string "version" be duplicated a thousand times?
Another project a lot of people aren't aware of even though they've benefitted from it indirectly is the binary format for OpenStreetMap. It allows reading the data without loading a lot of it into memory, and is a lot faster than using sqlite would be.
JSON's dominance is one of the most accidental success stories in computing.
Douglas Crockford didn't design it — he said he "discovered" it. It was already there in JavaScript's object literal syntax, which itself traces back to Brendan Eich's 10-day sprint in 1995.
A data format that conquered the internet was a side effect of a language built under absurd time pressure.
Every attempt to replace it has to overcome that kind of accidental ubiquity, which is much harder than overcoming a technical limitation.
A tiny note on the speed comparison: The 23,000x faster single-key lookup seems a bit misleading to me.
Once you get the computational complexity advantage, then you can make it as much times faster as you want. In these cases small instances matter to judge constants, and to the average (mean?) user, mean instance sizes.
I'm not sure how to sell the advantage succinctly though. Maybe just focus on "real-world" scenarios, but there's no footnote with details on the comparison
The documentation reference a “decode” function, and it’s imported to the example code, but it’s never called. I’m not sure what the API is after reading the examples.
108 comments
My one eyebrow raise is - is there no binary format specification? https://github.com/creationix/rx/blob/main/rx.ts#L1109 is pretty well commented, but you can't call it a JSON alternative without having some kind of equivalent to https://www.json.org/ in all its flowchart glory!
One old version that is meant to be more human readable/writable is jsonito
https://github.com/creationix/jsonito
I'll add similar diagrams and docs for the format itself here.
https://github.com/creationix/rx/blob/main/docs/rx-format.md
Railroad diagrams will come later when I have more time.
I am curious why it’s parsed right to left. Is this so that you could add new data to a top-level JSONL-esque list, solely by rewriting the end of the data structure, and not needing to change the beginning (or worst-case shift every single byte of data, if you need a longer count)?
It’s an interesting design tradeoff, because you can’t show a partial parse if you’re streaming the content naively beginning to end, which is a bit odd in a world where streams that begin to render token-by-token are all the rage.
But if you have an ability to do range queries, it’s quite effective, and it does allow for those incremental updates!
But it does open up a future direction I want to make with mutable datasets using append-only persistent data structures. The chain primitive is currently only used for strings, but it will be used to do the equivalent of
{...oldObj, ...newObj}as a single chain(pointerToOldObj, newObj).With chains and pointers, you can write new versions of a dataset and reuse all the existing values that are unchanged. This, combined with random-access reads and fixed-block caching makes for a fairly complete MVCC database.
The author claims this is because of copy and pasting… cool, remind me what BASE64 is again?
https://www.npmjs.com/package/@creationix/rx
(Or to avoid using cat to read, whatever2json file.whatever | jq)
This did catch my eye, however: https://github.com/creationix/rx?tab=readme-ov-file#proxy-be...
While this is a neat feature, this means it is not in fact a drop in replacement for JSON.parse, as you will be breaking any code that relies on the that result being a mutable object.
XML has EXI (Efficient XML Interchange) for precisely the reason of getting wins over the wire but keeping the nice human readable format at the ends.
Does this duplicate the name of keys? Say if you have a thousand plain objects in an array, each with a "version" key, would the string "version" be duplicated a thousand times?
Another project a lot of people aren't aware of even though they've benefitted from it indirectly is the binary format for OpenStreetMap. It allows reading the data without loading a lot of it into memory, and is a lot faster than using sqlite would be.
Edit: the rust library I remember may have been https://rkyv.org/
Even a technically superior format struggles without that ecosystem.
Douglas Crockford didn't design it — he said he "discovered" it. It was already there in JavaScript's object literal syntax, which itself traces back to Brendan Eich's 10-day sprint in 1995.
A data format that conquered the internet was a side effect of a language built under absurd time pressure.
Every attempt to replace it has to overcome that kind of accidental ubiquity, which is much harder than overcoming a technical limitation.
Once you get the computational complexity advantage, then you can make it as much times faster as you want. In these cases small instances matter to judge constants, and to the average (mean?) user, mean instance sizes.
I'm not sure how to sell the advantage succinctly though. Maybe just focus on "real-world" scenarios, but there's no footnote with details on the comparison
Docs are super unclear.
Is it versioned? Or does it need to be..