Rendered at 11:40:09 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
btown 1 days ago [-]
This is really interesting. At first glance, I was tempted to say "why not just use sqlite with JSON fields as the transfer format?" But everything about that would be heavier-weight in every possible way - and if I'm reading things right, this handles nested data that might itself be massive. This is really elegant.
I was instantly suspicious that a “new better format” for serialization didn’t open with the input/output. And this is why (fucking lol, gtfo):
Q^mSat,3^b:d+s+E,4Fri,3^u:h+k+u,6Thu,3^P:j+
If you are effectively going binary, do it. CBOR or Protobuf or any dozen other binary serializations that would be far more efficient.
The author claims this is because of copy and pasting… cool, remind me what BASE64 is again?
creationix 18 hours ago [-]
It is also a format that can be read as-is without any preprocessing. In some cases base64 can do that, and this format does make heavy use of base64 varints.
Sure, you can encode as JSON, then compress with gzip and then base64 encode. You'll probably end up with something smaller than rx and be extremely safe to copy-paste. But your consumers are going to consume orders of magnitude more CPU reading data from this document.
RX is usable as-is, is compressed, and is copy-pasteable. It's the unique combination of properties that makes it interesting.
SV_BubbleTime 13 hours ago [-]
>It is also a format that can be read as-is without any preprocessing.
>Q^mSat,3^b:d+s+E,4Fri,3^u:h+k+u,6Thu,3^P:j+
My man… no. I have no doubt you could kind of figure out what that sample is hot off the heels of writing this, and likely not in six months. And to consider that anyone else would fill their brain with the rules to decipher that, Nah 2.0.
creationix 10 hours ago [-]
I meant computers can read it without any preprocessing. It's random access. You don't need to parse it, you don't need to decompress it. You just start at the end and follow pointers till you get to the desired value.
Even a trivial doc like this is challenging for me to read as a human.
112233 6 hours ago [-]
But... what sort of storage device does not allow your computers to use all 256 byte values? Why is random access data stored on teletype?
hombre_fatal 7 hours ago [-]
Ick, why are you talking to another person like this?
> Nah.com, fam.
creationix 1 days ago [-]
- this encodes to ASCII text (unless your strings contain unicode themselves)
- that means you can copy-paste it (good luck doing that with compressed JSON or CBOR or SQLite
- there is a scale where JSON isn't human readable anymore. I've seen files that are 100+MB of minified JSON all on a single very long line. No human is reading that without using some tooling.
bawolff 1 days ago [-]
That kind of feels a bit worst of both worlds. None of the space savings/efficiency of binary but also no human readability.
Being able to copy/paste a serialization format is not really a feature i think i would care about.
creationix 22 hours ago [-]
It's a gradient. I did design several binary formats first, but for my use cases, this is actually better. There is nuance to various use cases.
> None of the space savings/efficiency of binary
For string heavy datasets, it's nearly the same encoding size as binary. I get 18x smaller sizes compared to JSON for my production datasets. This was originally designed as a binary format years ago (https://github.com/creationix/nibs) and then later after several iterations, converted to text.
> Being able to copy/paste a serialization format is not really a feature i think i would care about
Imagine being paged at 3am because some cache in some remote server got poisoned with a bad value (unrelated to the format itself). You load the value in dashboard, but it's encoded as CBOR or some binary format and so you have to download it in a binary safe way, upload that binary file to some tooling or install a cbor reader to your CLI. But then you realize that you don't have exec access to the k8s pods for security reasons, but do have access to a web-based terminal. Again, to extract a binary value you would need to create a shell, hexdump the file and somehow copy-paste that huge hexdump from the web-based terminal to your local machine, un-hex dump it, and finally load it into some CBOR reader.
A text format, however is as simple as copy-paste the value from the dashboard and paste into some online tool like https://rx.run/ to view the contents.
mpeg 1 days ago [-]
if one of the advantages is making it copy-pastable then I would suggest the REXC viewer should give you the option to copy the REXC output, currently I have no way of knowing this by looking at your github or demo viewer
another thing, I put in a 400KB json and the REXC is 250KB, cool, but ideally the viewer should also tell me the compressed sizes, because that same json is 65kb after zstd, no idea how well your REXC will compress
edit: I think I figured out you can right click "copy as REXC" on the top object in the viewer to get an output, and compressed it, same document as my json compressed to 110kb, so this is not great... 2x the size of json after compression.
creationix 22 hours ago [-]
Thanks for testing it out! Yes, the website could use some love to make everything more discoverable.
The primary use case is not compression, it's just a nice side effect of the deduplication. This will never beat something like zstd, brotli, or even gzip.
My production use cases are unique in that I can't afford the CPU to decompress to JSON and then parse to native objects. But with this format, I can use the text as-is with zero preprocessing and as a bonus my datasets are 18x smaller.
creationix 18 hours ago [-]
> 2x the size of json after compression
Right and that makes sense. There is more information in here. The entire thing is length prefixed and even indexed for O(1) array lookups and O(log2 N) object lookups.
If you don't care about random access and you don't mind the overhead of decompression, don't use RX.
mpeg 18 hours ago [-]
I think this makes sense, when you explain it like that, it might be a matter of cleaning up the docs a bit so the "why" of RX is more clear (admittedly, a README is not always the best channel for this!)
creationix 16 hours ago [-]
I've rewritten the framing in the README to first explain when you should use RX and when you should not. Most uses of JSON should probably stay JSON.
Are there any examples? If it's ASCII I'd expect to see some of the actual data in the readme, not just API.
Unless, to read that correctly, it only has a text encoding as long as you can guarantee you don't have any unicode?
creationix 16 hours ago [-]
> it only has a text encoding as long as you can guarantee you don't have any unicode?
The format is technically a binary format in that length prefixes are counts of bytes. But in practice it is a textual format since you can almost always copy-paste RX values from logs to chat messages to web forms without breaking it.
unciode doesn't break anything since strings are encoded as raw unicode with utf-8 byte length prefixes. It supports unicode perfectly.
If your data only contains 7-bit ASCII strings, the entire encoding is ASCII. If your data contains unicode, RX won't escape it, so the final encoding will contain unicode as UTF-8.
creationix 22 hours ago [-]
oh, sorry about that. I forgot to include the description of the format with examples.
The older, slightly outdated, design spec is in the older rex repo (this format was spun out of the rex project when I realized it's actually a good standalone format)
Very similar to bittorrent’s bencode. That has the benefit that it has a canonical encoding which this doesn’t (because of the different compression options). I wouldn’t be put off by how it looks as text.
creationix 10 hours ago [-]
Very true. I had forgotten about bencode, I should read up on that again.
It makes sense they need a canonical form because they want same values to have same content hashes.
kukkamario 1 days ago [-]
You don't want to copy-paste anything like that as text anyway. Just copy and paste files.
No human is reading much data regardless of the format.
What is the benefit over using for example BSON?
soco 1 days ago [-]
I have an idea, why don't we all go back using XML at this point, as any initial selling point / differentiator has been slowly eroded away?
creationix 16 hours ago [-]
Thanks for the feedback. I've improved the framing to make the purpose/value more clear. What do you think about "RX is a read-only embedded store for JSON-shaped data"?
It's also quite odd to create a serialization format optimized for random access.
creationix 22 hours ago [-]
Serialized just means encoded as a stream of bytes so that it can be transferred between systems. There are absolutely cases where you want to be able to query a value directly like a database instead of parsing the entire thing to memory before you can read it. Think of this as no-sql sqlite.
Gormo 8 hours ago [-]
> Serialized just means encoded as a stream of bytes so that it can be transferred between systems.
Yes, serially. Which means no random-access across the transfer channel.
j16sdiz 22 hours ago [-]
many serialization format are just a memory structure dump.
IshKebab 22 hours ago [-]
Not at all. What makes you say that?
dietr1ch 1 days ago [-]
cat file.whatever | whatever2json | jq ?
(Or to avoid using cat to read, whatever2json file.whatever | jq)
Gormo 23 hours ago [-]
That's not really random access, though. You're effectively just searching through the entire dataset for every targeted read you're after.
What might be interesting is to have a tool that processes full JSON data and creates a b-tree index on specified keys. Then you could run searches against the index that return byte offsets you can use for actual random access on the original JSON.
OTOH, this is basically just recreating a database, just using raw JSON as its storage format.
creationix 18 hours ago [-]
> What might be interesting is to have a tool that processes full JSON data and creates a b-tree index on specified keys. Then you could run searches against the index that return byte offsets you can use for actual random access on the original JSON.
I did build that once. But keeping track of the index is a pain. Sometimes I was able to generate the index on-demand and cache it in some ephemeral storage, but overall it didn't work out so well.
This system with RX will work better because I get the indexes built-in to the data file and can always convert it back to JSON if needed.
dietr1ch 21 hours ago [-]
Well, JSON had no random access to begin with, so maybe that's on needing JSON.
Maybe a query over the random-access file then converted into JSON would work?
creationix 1 days ago [-]
Or in this case, just do `rx file.rx` It has jq like queries built in and supports inputs with either rx or json. Also if you prefer jq, you can do `rx file.rx | jq`
dietr1ch 20 hours ago [-]
wow, on that case then using `jq` is just a presentation preference at the very last step unless jq is more expressive (which might be the case given how long it has been around?).
creationix 18 hours ago [-]
right, the jq query language is much more complex and featureful than the simple selector syntax I added to the rx-cli. But more could be added later as needed or it could just stream JSON output. It would be pretty trivial to hook up a streaming JSON encoder to rx-cli which could then pipe to jq for low-latency lookups. The problem is jq would need to JSON parse all that data which will be expensive.
While this is a neat feature, this means it is not in fact a drop in replacement for JSON.parse, as you will be breaking any code that relies on the that result being a mutable object.
creationix 1 days ago [-]
True, the particular use case where this really shines is large datasets where typical usage is to read a tiny part of it. Also there is no reason you couldn't write an rx parser that creates normal mutable objects. It could even be a hybrid one that is lazy parsed till you want to turn it mutable and then does a normal parse to normal objects after that point.
dtech 1 days ago [-]
It's not quite clear to me why you'd use this over something more established such as protobuf, thrift, flatbuffers, cap n proto etc.
maxmcd 1 days ago [-]
Those care about quickly sending compact messages over the network, but most of them do not create a sparse in-memory representation that you can read on the fly. Especially in javascript.
This lib keeps the compact representation at runtime and lets you read it without putting all the entities on the heap.
Cool!
creationix 22 hours ago [-]
Exactly. Low heap allocations when reading values is one of the main driving factors in this design!
IshKebab 22 hours ago [-]
Amazon Ion has some support for this - items are length-prefixed so you can skip over them easily.
It falls down if you have e.g. an array of 1 million small items, because you still need to skip over 999999 items to get to the last one. It looks like RX adds some support for indexes to improve that.
I was in this situation where we needed to sparsely read huge JSON files. In the end we just switched to SQLite which handles all that perfectly. I'd probably still use it over RX, even though there's a somewhat awkward impedance mismatch between SQL and structs.
creationix 18 hours ago [-]
I did seriously consider SQLite, but my existing datasets don't map easily to relational database tables. This is essentially no-sql for sqlite.
konart 1 days ago [-]
What if you are reading from a service which already have an established API?
It's not like you can just tell them to move to protobuf.
SV_BubbleTime 21 hours ago [-]
What about CBOR that can retain JSON compatibility?
If you are working with an end you don’t control, this “newer better” format isn’t in your cards either.
creationix 16 hours ago [-]
How does CBOR retain JSON compatibility more than RX?
RX can represent any value JSON can represent. It doesn't even lose key order like some random-access formats do.
In fact, RX is closer to JSON than CBOR.
Take decimals as an example:
JSON numbers are arbitrary precision numbers written in decimal. This means it can technically represent any decimal number to full precision.
CBOR stores numbers as binary floats which are appriximations of decimal numbers. This is why they needed to add Decimal Fractions (Tag 4)
RX already stores as decimal base and decimal power of 10. So out of the box, it matches JSON
barishnamazov 1 days ago [-]
You shouldn't be using JSON for things that'd have performance implications.
creationix 1 days ago [-]
As with most things in engineering, it depends. There are real logistical costs to using binary formats. This format is almost compact as a binary format while still retaining all the nice qualities of being an ASCII friendly encoding (you can embed it anywhere strings are allowed, including copy-paste workflows)
Think of it as a hybrid between JSON, SQLite, and generic compression. This format really excels for use cases where large read-only build artifacts are queried by worker nodes like an embedded database.
Asmod4n 1 days ago [-]
The cost of using a textual format is that floats become so slow to parse, that it’s a factor of over 14 times slower than parsing a normal integer. Even with the fastest simd algos we have right now.
HelloNurse 1 days ago [-]
So it depends.
Float parsing performance is only a problem if you parse many floats, and lazy access might reduce work significantly (or add overhead: it depends).
creationix 22 hours ago [-]
Exactly. My for use cases, this format is amazing. I have very few floats, but lots and lots of objects, arrays and strings with moderate levels of duplication and substring duplication. My data is produced in a build and then read in thousands or millions of tiny queries that lookup up a single value deep inside the structure.
rx works very well as a kind of embedded database like sqlite, but completely unstructured like JSON.
Also I'm working on an extension that makes it mutable using append-only persistent data structures with a fixed-block caching level that is actually a pretty good database.
creationix 22 hours ago [-]
if you data is lots and lots of arrays of floats, this is likely not the format for you. Use float arrays.
Also note it stores decimal in a very compact encoding (two varints for base and power of 10)
That said, while this is a text format, it is also technically binary safe and could be extended with a new type tag to contain binary data if desired.
meehai 1 days ago [-]
and with little data (i.e. <10Mb), this matters much less than accessibility and easy understanding of the data using a simple text editor or jq in the terminal + some filters.
creationix 9 hours ago [-]
Also good luck parsing 10 MiB of JSON in a loop that can't tolerate blocking the CPU for more than 10ms.
What's expensive is very relative to the use case.
xxs 1 days ago [-]
what do you mean by little data, most communication protocols are not one off
hrmtst93837 1 days ago [-]
That rule sounds clean until the DB dump, API trace, or langauge boundary lands in your lap. Binary formats are fine for tight inner loops, but once the data leaks into logs, tooling, support, or a second codebase, the bytes you saved tend to come back as time lost decoding some bespoke mess.
creationix 22 hours ago [-]
Yep. I did try binary formats first. I tried existing ones like CBOR, I tried making my own like Nibs. The text encoding is an operational concern, not a technical one.
This is the same reason I've been advocating for JSONL at work. It's not ideal technically, but it's a good balance of technically good enough while being also human friendly when things go wrong.
RX is one step towards less human friendly, but more machine friendly. I try to keep things balanced in my designs.
squirrellous 1 days ago [-]
I agree in principle. However JSON tooling has also got so good that other formats, when not optimized and held correctly, can be worse than JSON. For example IME stock protocol buffers can be worse than a well optimized JSON library (as much as it pains me to say this).
tabwidth 1 days ago [-]
Yeah the raw parse speed comparison is almost a red herring at this point. The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it. You're still loading the whole thing into memory, building the full object graph, and GC gets to clean all of it up after. That's the part where something like RX with selective access actually matters. Parse speed benchmarks don't capture that at all.
magicalhippo 23 hours ago [-]
> The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it.
There are SAX-like JSON libraries out there, and several of them work with a preallocated buffer or similar streaming interface, so you could stream the file and pick out the two fields as they come along.
IshKebab 22 hours ago [-]
You still have to parse half the entire file on average. Much slower than formats that support skipping to the relevant information directly.
creationix 22 hours ago [-]
yep, this is exactly the kind of use case that caused me to design this format.
xxs 1 days ago [-]
as parser: keep only indexes to the original file (input), dont copy strings or parse numbers at all (unless the strings fit in the index width, e.g. 32bit)
That would make parsing faster and there will be very little in terms on tree (json can't really contain full blow graphs) but it's rather complicated, and it will require hashing to allow navigation, though.
creationix 18 hours ago [-]
yep. I built custom JSON parsers as a first solution. The problem is you can't get away from scanning at least half the document bytes on average.
With RX and other truly random-access formats you could even optimize to the point of not even fetching the whole document. You could grab chunks from a remote server using HTTP range requests and cache locally in fixed-width blocks.
With JSON you must start at the front and read byte-by-byte till you find all the data you're looking for. Smart parsers can help a lot to reduce heap allocations, but you can't skip the state machine scan.
Spivak 1 days ago [-]
Can you imagine if a service as chatty and performance sensitive as Discord used JSON for their entire API surface?
dietr1ch 20 hours ago [-]
A tiny note on the speed comparison: The 23,000x faster single-key lookup seems a bit misleading to me.
Once you get the computational complexity advantage, then you can make it as much times faster as you want. In these cases small instances matter to judge constants, and to the average (mean?) user, mean instance sizes.
I'm not sure how to sell the advantage succinctly though. Maybe just focus on "real-world" scenarios, but there's no footnote with details on the comparison
creationix 17 hours ago [-]
That benchmark is a fair comparison for a real-world production workload and use case. Sadly I can't share the details. But suffice it to say that the dataset is a huge object with tens of thousands of paths as keys and moderately large objects as values (averaging around 3KB of JSON each) all with slightly different shapes. The use is reading just a few entries by path an then looking up some properties within those entries.
The benchmark (or is supposed to) measures end-to-end parse + lookup.
JSON: 92 MB
RX: 5.1 MB
Request-path lookup: ~47,000x faster
Time to decode a manifest and look up one URL path:
JSON: 69 ms
REXC: 0.003 ms
Heap allocations: 2.6 million vs. 1
JSON: 2,598,384
REXC: 1 (the returned string)
50lo 1 days ago [-]
The biggest challenge for formats like this is usually tooling. JSON won largely because: every language supports it, every tool understands it.
Even a technically superior format struggles without that ecosystem.
latexr 1 days ago [-]
And that in turn affects tool adoption. I have dabbled in Lua for interacting with other software such as mpv, but never got much into the weeds with it because it lacks native JSON support, and I need to interact with JSON all the time.
creationix 15 hours ago [-]
yeah, LuaJIT is one of the use cases I had in mind working on this. JSON is pretty fast in modern JS engines, but in Lua land, JSON kinda sucks and doesn't really match the language without using virtual tables.
JSON has `null` values with string keyds, but lua doesn't have `null`. It has `nil`, but you can't have a key with a nil value. Setting nil deletes the key
Lua tables are unordered. But JS and JSON are often ordered and order often matters.
RX, however matches Lua/LuaJIT extremely well and should out-perform the JS Proxy based decoder using metatables. Since it's using metatables anyway do to the lazy parsing, it's trivial to do things like preserve order when calling `pairs` and `ipairs` and even including keys with associated null values.
You can round trip safely in Lua, which is not easy with most JSON implementations.
jbverschoor 1 days ago [-]
So this is two things? A BSON-like encoding + something similar to implementing random access / tree walker using streaming JSON?
Docs are super unclear.
_flux 1 days ago [-]
It doesn't seem the actual serialization format is specified? Other than in the code that is.
Is it versioned? Or does it need to be..
killbot5000 20 hours ago [-]
The documentation reference a “decode” function, and it’s imported to the example code, but it’s never called. I’m not sure what the API is after reading the examples.
creationix 1 days ago [-]
A new random-access JSON alternative from the creator of nvm.sh, luvit.io, and js-git.
bsimpson 1 days ago [-]
It feels petty to show up with a naming not, but the name is unfortunately/confusingly similar to the already well-known RxJS.
Why is it called RX?
creationix 18 hours ago [-]
I'm happy to hear suggestions. This format was actually the internal .rexc bytecode for Rex (routing expressions), but when I realized it was actually a pretty good standalone format, I renamed it `.rx` for short. I am aware of RxJS, but I think that `rx-format` is different enough and `.rx` file extensions are unique enough, it's not too confusing.
sick is binary, rx is textual (this matters for tooling)
sick has size limits (65534 max keys for example. I have real-world rx datasets reaching this size already)
rx uses arbitrary precision variable-length b64 integers. There are no size limits anywhere inherit in the format, just in implementations.
sick does not preserve object key order
rx preserves object key order, but still implements O(log2 N) lookups for object keys.
etc.
WatchDog 1 days ago [-]
Cool project.
The viewer is cool, took me a while to find the link to it though, maybe add a link in the readme next to the screenshot.
TKAB 21 hours ago [-]
could this be useful for embedding info in server generated web pages that are then picked up by a JavaScript. e.g. a tom-select country picker that gets its data from an embedded RX structure?
creationix 18 hours ago [-]
yes, this would work very well for any case where you have embedded databases of unstructured data that you want to query in a website or edge server
Spivak 1 days ago [-]
I love these projects, I hope one of them someday emerges as the winner because (as it motivates all these libraries' authors) there's so much low hanging fruit and free wins changing the line format for JSON but keeping the "Good Parts" like the dead simple generic typing.
XML has EXI (Efficient XML Interchange) for precisely the reason of getting wins over the wire but keeping the nice human readable format at the ends.
snthpy 1 days ago [-]
TIL.
EXI looks useful. Now I just wish there was a renderer in the pugjs format as I find that terse format much pure readable than verbose XML. I also find indentation based syntax easier to visually parse hierarchical structure.
transfire 1 days ago [-]
I am a little confused. Is this still JSON? Is it “binary“ JSON?
Human unreadable, ascii output. Line up and get yours today!
creationix 18 hours ago [-]
it's not really possible to stay human readable and get the compression levels and random access properties I was going for. But it is as human tooling friendly as possible given the constraints.
SV_BubbleTime 13 hours ago [-]
>it's not really possible
I find it obvious that your first attempt failed. Try again, you have not even remotely failed enough if you are making the argument that this is kinda readable. Yes, ascii words are easy to pick out, you didn’t do that, you did the part that makes it all harder.
benatkin 1 days ago [-]
Interesting. I've heard about cursors in reference to a Rust library that was mentioned as being similar to protobuf and cap'n proto.
Does this duplicate the name of keys? Say if you have a thousand plain objects in an array, each with a "version" key, would the string "version" be duplicated a thousand times?
Another project a lot of people aren't aware of even though they've benefitted from it indirectly is the binary format for OpenStreetMap. It allows reading the data without loading a lot of it into memory, and is a lot faster than using sqlite would be.
Yes, the format allows for objects to be stored with a pointer to a shared schema (either an array of keys or another object that has the desired keys)
The current implementation is pretty close to ideal when deciding to use this encoding.
gritzko 1 days ago [-]
I recently created my own low-overhead binary JSON cause I did not like Mongo's BSON (too hacky, not mergeable). It took me half a day maybe, including the spec, thanks Claude. First, implemented the critical feature I actually need, then made all the other decisions in the least-surprising way.
At this point, probably, we have to think how to classify all the "JSON alternatives" cause it gets difficult to remember them all.
The current format version is the exact same feature set as JSON. I even encode numbers as arbitrary precision decimals (which JSON also does). This is quite different from CBOR which stores floats in binary as powers of 2.
I could technically add binary to the format, but then it would lose the nice copy-paste property. But with the byte-aware length prefixes, it would just work otherwise.
SV_BubbleTime 21 hours ago [-]
You went from BSON to your own and skipped CBOR and Protobuf? … I wonder if you would have made different decisions without Claude vibing you in a direction?
1 days ago [-]
openclaw01 1 days ago [-]
[dead]
DaleBiagio 20 hours ago [-]
[dead]
derodero24 1 days ago [-]
[dead]
AliEveryHour16 24 hours ago [-]
[dead]
StephenZ15ga67 1 days ago [-]
[flagged]
Shahbazay0719 1 days ago [-]
[flagged]
NoSalt 22 hours ago [-]
Why do we need an "alternative" when JSON, itself, is so fantastic?
creationix 18 hours ago [-]
the project framing needs some help perhaps. JSON is really good at a lot of use cases that this will never replace. But there are cases where JSON is currently used where this is much better. In particular large unstructured datasets where you only need to read a tiny subset of the data in a single request.
My one eyebrow raise is - is there no binary format specification? https://github.com/creationix/rx/blob/main/rx.ts#L1109 is pretty well commented, but you can't call it a JSON alternative without having some kind of equivalent to https://www.json.org/ in all its flowchart glory!
One old version that is meant to be more human readable/writable is jsonito
https://github.com/creationix/jsonito
I'll add similar diagrams and docs for the format itself here.
https://github.com/creationix/rx/blob/main/docs/rx-format.md
Railroad diagrams will come later when I have more time.
The author claims this is because of copy and pasting… cool, remind me what BASE64 is again?
Sure, you can encode as JSON, then compress with gzip and then base64 encode. You'll probably end up with something smaller than rx and be extremely safe to copy-paste. But your consumers are going to consume orders of magnitude more CPU reading data from this document.
RX is usable as-is, is compressed, and is copy-pasteable. It's the unique combination of properties that makes it interesting.
>Q^mSat,3^b:d+s+E,4Fri,3^u:h+k+u,6Thu,3^P:j+
My man… no. I have no doubt you could kind of figure out what that sample is hot off the heels of writing this, and likely not in six months. And to consider that anyone else would fill their brain with the rules to decipher that, Nah 2.0.
Even a trivial doc like this is challenging for me to read as a human.
> Nah.com, fam.
Being able to copy/paste a serialization format is not really a feature i think i would care about.
> None of the space savings/efficiency of binary
For string heavy datasets, it's nearly the same encoding size as binary. I get 18x smaller sizes compared to JSON for my production datasets. This was originally designed as a binary format years ago (https://github.com/creationix/nibs) and then later after several iterations, converted to text.
> Being able to copy/paste a serialization format is not really a feature i think i would care about
Imagine being paged at 3am because some cache in some remote server got poisoned with a bad value (unrelated to the format itself). You load the value in dashboard, but it's encoded as CBOR or some binary format and so you have to download it in a binary safe way, upload that binary file to some tooling or install a cbor reader to your CLI. But then you realize that you don't have exec access to the k8s pods for security reasons, but do have access to a web-based terminal. Again, to extract a binary value you would need to create a shell, hexdump the file and somehow copy-paste that huge hexdump from the web-based terminal to your local machine, un-hex dump it, and finally load it into some CBOR reader.
A text format, however is as simple as copy-paste the value from the dashboard and paste into some online tool like https://rx.run/ to view the contents.
another thing, I put in a 400KB json and the REXC is 250KB, cool, but ideally the viewer should also tell me the compressed sizes, because that same json is 65kb after zstd, no idea how well your REXC will compress
edit: I think I figured out you can right click "copy as REXC" on the top object in the viewer to get an output, and compressed it, same document as my json compressed to 110kb, so this is not great... 2x the size of json after compression.
The primary use case is not compression, it's just a nice side effect of the deduplication. This will never beat something like zstd, brotli, or even gzip.
My production use cases are unique in that I can't afford the CPU to decompress to JSON and then parse to native objects. But with this format, I can use the text as-is with zero preprocessing and as a bonus my datasets are 18x smaller.
Right and that makes sense. There is more information in here. The entire thing is length prefixed and even indexed for O(1) array lookups and O(log2 N) object lookups.
If you don't care about random access and you don't mind the overhead of decompression, don't use RX.
Let me know what you think
https://github.com/creationix/rx/blob/main/README.md#when-to...
Unless, to read that correctly, it only has a text encoding as long as you can guarantee you don't have any unicode?
The format is technically a binary format in that length prefixes are counts of bytes. But in practice it is a textual format since you can almost always copy-paste RX values from logs to chat messages to web forms without breaking it.
unciode doesn't break anything since strings are encoded as raw unicode with utf-8 byte length prefixes. It supports unicode perfectly.
If your data only contains 7-bit ASCII strings, the entire encoding is ASCII. If your data contains unicode, RX won't escape it, so the final encoding will contain unicode as UTF-8.
I did add some small examples to the repo.
https://github.com/creationix/rx/blob/main/samples/quest-log...
The older, slightly outdated, design spec is in the older rex repo (this format was spun out of the rex project when I realized it's actually a good standalone format)
https://github.com/creationix/rex/blob/main/rexc-bytecode.md
Oof.
It makes sense they need a canonical form because they want same values to have same content hashes.
No human is reading much data regardless of the format.
What is the benefit over using for example BSON?
https://www.npmjs.com/package/@creationix/rx
Yes, serially. Which means no random-access across the transfer channel.
(Or to avoid using cat to read, whatever2json file.whatever | jq)
What might be interesting is to have a tool that processes full JSON data and creates a b-tree index on specified keys. Then you could run searches against the index that return byte offsets you can use for actual random access on the original JSON.
OTOH, this is basically just recreating a database, just using raw JSON as its storage format.
I did build that once. But keeping track of the index is a pain. Sometimes I was able to generate the index on-demand and cache it in some ephemeral storage, but overall it didn't work out so well.
This system with RX will work better because I get the indexes built-in to the data file and can always convert it back to JSON if needed.
Maybe a query over the random-access file then converted into JSON would work?
This did catch my eye, however: https://github.com/creationix/rx?tab=readme-ov-file#proxy-be...
While this is a neat feature, this means it is not in fact a drop in replacement for JSON.parse, as you will be breaking any code that relies on the that result being a mutable object.
This lib keeps the compact representation at runtime and lets you read it without putting all the entities on the heap.
Cool!
It falls down if you have e.g. an array of 1 million small items, because you still need to skip over 999999 items to get to the last one. It looks like RX adds some support for indexes to improve that.
I was in this situation where we needed to sparsely read huge JSON files. In the end we just switched to SQLite which handles all that perfectly. I'd probably still use it over RX, even though there's a somewhat awkward impedance mismatch between SQL and structs.
It's not like you can just tell them to move to protobuf.
If you are working with an end you don’t control, this “newer better” format isn’t in your cards either.
RX can represent any value JSON can represent. It doesn't even lose key order like some random-access formats do.
In fact, RX is closer to JSON than CBOR.
Take decimals as an example:
JSON numbers are arbitrary precision numbers written in decimal. This means it can technically represent any decimal number to full precision.
CBOR stores numbers as binary floats which are appriximations of decimal numbers. This is why they needed to add Decimal Fractions (Tag 4)
RX already stores as decimal base and decimal power of 10. So out of the box, it matches JSON
Think of it as a hybrid between JSON, SQLite, and generic compression. This format really excels for use cases where large read-only build artifacts are queried by worker nodes like an embedded database.
rx works very well as a kind of embedded database like sqlite, but completely unstructured like JSON.
Also I'm working on an extension that makes it mutable using append-only persistent data structures with a fixed-block caching level that is actually a pretty good database.
Also note it stores decimal in a very compact encoding (two varints for base and power of 10)
That said, while this is a text format, it is also technically binary safe and could be extended with a new type tag to contain binary data if desired.
What's expensive is very relative to the use case.
This is the same reason I've been advocating for JSONL at work. It's not ideal technically, but it's a good balance of technically good enough while being also human friendly when things go wrong.
- https://vercel.com/blog/how-we-made-global-routing-faster-wi... - https://vercel.com/blog/scaling-redirects-to-infinity-on-ver...
RX is one step towards less human friendly, but more machine friendly. I try to keep things balanced in my designs.
There are SAX-like JSON libraries out there, and several of them work with a preallocated buffer or similar streaming interface, so you could stream the file and pick out the two fields as they come along.
That would make parsing faster and there will be very little in terms on tree (json can't really contain full blow graphs) but it's rather complicated, and it will require hashing to allow navigation, though.
With RX and other truly random-access formats you could even optimize to the point of not even fetching the whole document. You could grab chunks from a remote server using HTTP range requests and cache locally in fixed-width blocks.
With JSON you must start at the front and read byte-by-byte till you find all the data you're looking for. Smart parsers can help a lot to reduce heap allocations, but you can't skip the state machine scan.
Once you get the computational complexity advantage, then you can make it as much times faster as you want. In these cases small instances matter to judge constants, and to the average (mean?) user, mean instance sizes.
I'm not sure how to sell the advantage succinctly though. Maybe just focus on "real-world" scenarios, but there's no footnote with details on the comparison
The benchmark (or is supposed to) measures end-to-end parse + lookup.
JSON: 92 MB RX: 5.1 MB
Request-path lookup: ~47,000x faster
Time to decode a manifest and look up one URL path:
JSON: 69 ms REXC: 0.003 ms
Heap allocations: 2.6 million vs. 1
JSON: 2,598,384 REXC: 1 (the returned string)
Even a technically superior format struggles without that ecosystem.
JSON has `null` values with string keyds, but lua doesn't have `null`. It has `nil`, but you can't have a key with a nil value. Setting nil deletes the key
Lua tables are unordered. But JS and JSON are often ordered and order often matters.
RX, however matches Lua/LuaJIT extremely well and should out-perform the JS Proxy based decoder using metatables. Since it's using metatables anyway do to the lazy parsing, it's trivial to do things like preserve order when calling `pairs` and `ipairs` and even including keys with associated null values.
You can round trip safely in Lua, which is not easy with most JSON implementations.
Docs are super unclear.
Is it versioned? Or does it need to be..
Why is it called RX?
sick is binary, rx is textual (this matters for tooling)
sick has size limits (65534 max keys for example. I have real-world rx datasets reaching this size already) rx uses arbitrary precision variable-length b64 integers. There are no size limits anywhere inherit in the format, just in implementations.
sick does not preserve object key order rx preserves object key order, but still implements O(log2 N) lookups for object keys.
etc.
The viewer is cool, took me a while to find the link to it though, maybe add a link in the readme next to the screenshot.
XML has EXI (Efficient XML Interchange) for precisely the reason of getting wins over the wire but keeping the nice human readable format at the ends.
EXI looks useful. Now I just wish there was a renderer in the pugjs format as I find that terse format much pure readable than verbose XML. I also find indentation based syntax easier to visually parse hierarchical structure.
Sample output:
'fdiscovered,aextreme,7danger,6+1A+16;6level_range,b:QThe Heap ,d'th
Human unreadable, ascii output. Line up and get yours today!
I find it obvious that your first attempt failed. Try again, you have not even remotely failed enough if you are making the argument that this is kinda readable. Yes, ascii words are easy to pick out, you didn’t do that, you did the part that makes it all harder.
Does this duplicate the name of keys? Say if you have a thousand plain objects in an array, each with a "version" key, would the string "version" be duplicated a thousand times?
Another project a lot of people aren't aware of even though they've benefitted from it indirectly is the binary format for OpenStreetMap. It allows reading the data without loading a lot of it into memory, and is a lot faster than using sqlite would be.
Edit: the rust library I remember may have been https://rkyv.org/
Yes, the format allows for objects to be stored with a pointer to a shared schema (either an array of keys or another object that has the desired keys)
The current implementation is pretty close to ideal when deciding to use this encoding.
At this point, probably, we have to think how to classify all the "JSON alternatives" cause it gets difficult to remember them all.
Is RX a subset, a superset or bijective to JSON?
https://github.com/gritzko/librdx/tree/master/json
I could technically add binary to the format, but then it would lose the nice copy-paste property. But with the byte-aware length prefixes, it would just work otherwise.
Maybe a better framing would be no-sql sqlite?