Hacker News

Bad JSON Parsers(github.com)

211 pointslovasoa posted 3 days ago188 Comments
188 Comments:
thegeomaster said 3 days ago:

The JSON spec [1] clearly states, in section 9:

  An implementation may set limits on the maximum depth of
  nesting.
So, this is kind of a pointless exercise. The author would do well to inform themselves better before slinging epithets such as "bad" for a library.

[1]: https://tools.ietf.org/html/rfc7159

ken said 3 days ago:

I would be perfectly OK with that answer if they documented that they use the C stack and therefore have a small finite depth limit, and/or threw some sort of TooMuchNesting exception before they hit that limit.

AFAICT, many/most of these libraries do neither of the above. There's nothing in the documentation to suggest they'll have trouble with deeply nested inputs, and their response to such a situation is to simply crash the process. I'd call that a bug.

I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons". If you've got limits, you've got to document them, or at least make them recoverable.

asveikau said 3 days ago:

It's hard for a library to portably know when the C stack will exhaust or give a concrete estimate that makes sense for documentation purposes.

Firstly, that concept is not exposed at all to standard complaint C code. The standard is vague enough to allow for many possible variations of the program stack abstraction. So you would not be portable, standard C.

Second, the library author doesn't know ahead of time how much stack space you, the caller, have consumed before calling the library. Do you call it deep inside some ui framework with a big stack? With a huge i/o buffer on the stack? They don't know.

Translating all of this into "this is how deeply nested your json can be" is not a sane exercise, at best you can give a machine-specific guess that will vary a bit with your own application's stack needs.

But perhaps one can side-step all of that and get most of the way with an artificial, arbitrary-constant runtime depth check that is safely below likely actual stack limits.

ken said 3 days ago:

Sure, that’s all true. So don’t use the C stack for parsing. It’s not difficult.

taneq said 3 days ago:

> I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons".

Welcome to C, enjoy your stay.

ken said 3 days ago:

Most of these libraries are for HLLs.

nabla9 said 3 days ago:

I just wrote standard conforming JSON parser:

    cat standard-still-bad.sh

    #!/bin/sh
    echo "ERROR: JSON parser implementation limit." 1>&2
    exit 1 # terminate and indicate error

--- https://tools.ietf.org/html/rfc7159#section-9

> An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings.

otterley said a day ago:

"Conforming" is not identical to "useful." A car that doesn't start is still a car.

Svip said 3 days ago:

It's funny, the spec they references[0] makes no mention of parsers or nesting. But RFC 8259[1] clearly does. Moreover, I notice they were both published in December 2017. Why two specs for the same standard?

[0] http://www.ecma-international.org/publications/files/ECMA-ST...

[1] https://tools.ietf.org/html/rfc8259

gsnedders said 3 days ago:

As is almost always the case when there are multiple specs purporting to describe the same thing: politics.

The Ecma spec is meant to be a spec defining a format and its semantics; the RFC places requirements on implementations of that format as well as describing it's semantics.

snek said 3 days ago:

There was a rumor that was IETF was going to specify a bunch of new breaking changes in RFC 7195 (March 2014), and because of that (irrational) fear ECMA published the 1st edition of 404 very quickly in October of 2013.

RFC 8259 and the 2nd edition of ECMA-404 contain the same change, a single sentence, specifying that JSON must use UTF-8.

skybrian said 3 days ago:

Maybe the headline is bad, but making a table of implemention limits of various JSON parsers seems useful?

Maybe we focus on headlines too much?

webkike said 3 days ago:

I dunno, the author claims they're sorting the table from worst to best, instead of by minimum maximum depth to maximum maximum depth

lovasoa said 3 days ago:

Maybe I shouldn't have called this repository "bad json parsers". I just added the following to the README:

> The name of this repository is intentionally provocative. It's goal is to document limitations of existing json parsers, not to denigrate the work of the many contributors who worked on the various libraries presented here.

Igelau said 3 days ago:

The word you're looking for is clickbait ;)

"We Ranked The Top JSON Parsers By Call Stack Depth... Number 6 Will Shock You!"

bb88 said 3 days ago:

First appreciate the change to the readme. I think that goes a long way.

Second, "bad" here is highly subjective. I've never ever hit the recursive depth limit in python with JSON, and if I do, I'll probably think about doing it a different way, in a way that won't require python to do deep recursion.

elpocko said 3 days ago:

Its*. "It is goal" makes no sense.

lovasoa said 3 days ago:

fixed

paulddraper said 3 days ago:

Yeah, this is dumb.*

I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec) than not allowing thousands of nested arrays.

---

* Except that this is potential DoS vector. From that angle, it has some interest.

Svip said 3 days ago:

> I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec)

The spec has something specifically to say about this[0]:

> Note that when such software is used, numbers that are integers and are in the range [-(253)+1, (253)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

Basically, allow up to 2^53, and you ought to be fine.

[0] https://tools.ietf.org/html/rfc8259#section-6

paulddraper said 3 days ago:

Agreed, such a parse would be "to spec". It says "An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers."

On a practical level, parsers not handling 64-bit ints has bit me more than parsers not handling 1000+ levels of nesting.

axman6 said 3 days ago:

As a side note, the Haskell implementation also supports numbers of arbitrary size, again limited by memory, for both parsing and encoding.

andrepd said 3 days ago:

From TFA

>The two most recent JSON standards RFC 8259 and RFC 7159 both say "An implementation may set limits on the maximum depth of nesting". However, the ECMA-404 specification doesn't contain any limit on how deeply nested JSON structures can be.

ddlatham said 3 days ago:

To be fair, that text was added after the above comment was posted.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

mantap said 3 days ago:

Not stating a limit doesn't mean there's NO limit. It means the limit is implementation defined. If the way you interpret a spec implies a requirement of infinite memory or infinite stack, then the way you interpret it is wrong.

EugeneOZ said 3 days ago:

Reading "MUST" when specification says "may" is also wrong.

vesinisa said 3 days ago:

Nevertheless, the parser failing at only 100 levels of nesting is shockingly bad.

femto113 said 3 days ago:

I work with complex JSON every day. Even painfully complex structures like Facebook's Ad Insights API or AWS Lambda Events only get around 5 levels deep, so 10 deep is already exotic, and I can't even conceive of a use case for 100. The method of failure is important too--would it really be better to just keep going forever, or die when maximum stack depth was reached, or the machine runs out of RAM (or swap space...)? Treating absurdly nested JSON the same way you would any invalid JSON input seems like a much friendlier and less dangerous approach overall.

nostrademons said 3 days ago:

Think of error cases and malicious input instead of well-written APIs.

When I was testing Gumbo [1] on all the HTML files in Google's index, I ran into one that had 48,000 levels of nested HTML tags. (Ironically, Gumbo itself handled this fine, but the test script I used to verify the output died.) I posted it to Memegen with a link to the original source URL, and then got called out for crashing Chrome. Apparently I wasn't the only one who didn't think about recursion limits in a parser. (Both bugs have since been fixed.)

What was the wayward HTML file? It was an XML file served with the wrong content type. The file contained 48,000 self-closing tags. When you parse an unknown element with a self-closing tag in HTML, it ignores the trailing />, treats it as an ordinary unknown element, and happily keeps generating a deeper and deeper DOM.

A stack overflow that results in a segfault is a pretty serious DOS vulnerability in a JSON parser. You probably could take down a good portion of the Internet by sending JSON requests to their AJAX endpoints that consist of 1M of {.

[1] https://github.com/google/gumbo-parser

markpapadakis said 3 days ago:

Was Gumbo used for parsing crawled pages for indexing reasons ? How was it used internally ? I presume they use some headless chrome like technology for most of that nowadays.

nostrademons said 3 days ago:

No, the indexer used a custom HTML parser that was ancient (IIRC, when I was there it wasn't even fully HTML5-compliant - it doesn't have to be, because if you're tokenizing pages for indexing the only really important parts are text, anchors, meta tags, formatting, and headers, and as long as you basically get those right it's not important that you have the exact parse tree that a browser would.) It was very tightly optimized for speed, eg. it was SAX-style, with callbacks instead of heap allocations; the data that it returned to the caller was indexes into the original source data rather than case-normalizing and copying buffers; and the liberties it took with the HTML spec were largely so that it could minimize branching and state during the parse.

Headless Chrome existed within the indexing system while I was there but it was turned on for only a very small section of the index for CPU reasons. Their public statements since have indicated it's used for everything now.

Gumbo grew out of a templating language that I was working on with a larger team within Search Infrastructure and Maps. AFAIU the templating language is still being used, though I think Gumbo isn't being used for it anymore (its syntax diverged from pure HTML5 at some point). It was really intended for tools like refactoring & static analysis: that's why it's slower than the indexing HTML parser; why it was done in C rather than C++ (easier binding to scripting languages); why the API is intended to be round-trippable, with access to the original text that data structures refer to; and why it put a premium on generating accurate parses over speed.

axman6 said 3 days ago:

Stick any structure in DynamoDB and you’ve already halved your nesting depth due to its encoding a superset off JSON in JSON. Then start encoding more complex structures like the ones the others have mentioned and these limitations start to bite pretty quickly.

Bnshsysjab said 3 days ago:

Python handles function stack depth by limiting it to 1024, I’ve run into that limitation with recursive functions a few times but I can override it by setting a new maximum in code. Couldn’t similar be done with JSON?

ken said 3 days ago:

> I can't even conceive of a use case for 100

A naive serialization of cons cells?

lovasoa said 3 days ago:

> I can't even conceive of a use case for 100

A serialized trie ?

ryanianian said 3 days ago:

Lots of recursive data-structures (trees/graphs/*) lend themselves nicely to lots of nesting. Such documents aren't really intended to be read/written by humans.

cortesoft said 3 days ago:

If they aren't meant to be read/written by humans, JSON is probably a bad choice.

crazygringo said 3 days ago:

I couldn't disagree more.

If you need to serialize and deserialize data between tools written in different languages -- e.g. write data to a file and read it in another tool -- then what would be better than JSON?

Nothing about JSON is inherently good or bad for deeply hierarchical/recursive data.

shakna said 3 days ago:

Deeply hierarchical/recursive data is equally well served by XML and Protobuf.

However, JSON itself can be... Painful for some subsets of data that you may wish to move between applications. Binary data, well-typed data. JSON tends [0] uses floats, so you may need to adjust your expectations about how to store numbers if precision matters. You can workaround a lot of these, such as through various encodings, but they are workarounds.

A lot of those workarounds depend on re-encoding, which unfortunately means using strings, which can run afoul of "An implementation may set limits on the length and character contents of strings." and silently drop important bytes, or simply cut off the end of the payload.

[0] The grammar is well-defined, the behaviour is not. "This specification allows implementations to set limits on the range and precision of numbers accepted." https://tools.ietf.org/html/rfc8259#page-7

Bnshsysjab said 3 days ago:

Wouldn’t XML increase filesize even with compression applied? Obviously there are better machine readable libraries out there but XML is the middleground between efficiency and support

shakna said 3 days ago:

I would lean toward Protobuf wherever you can use it. It is designed to deal with issues like this. However, you can't always fulfill some of the requirements for it.

In those areas, XML strikes a practical balance. Especially if you can use a significant subset of XML. (If you don't need DOM, then use a parser that doesn't produce it, etc.)

cortesoft said 3 days ago:

I would say protobuffs

vesinisa said 3 days ago:

No, and if they can not be even read by a machine, it's bad.

corprew said 3 days ago:

This works out well in practice -- the ruby parser that ships with the language (the 100 levels one) works for most purposes and will even go back to an all-ruby implementation if it can't load its C extension. It's a reasonable default parser for almost all cases.

However, if you need the all-singing, all-dancing, handles the most complex use cases super fast parser, which is Oj at the bottom (which handles infinite levels), then you have the option to install it and use it.

bryanrasmussen said 3 days ago:

Bad is a value statement, they didn't say not to spec, invalid, or similar.

thechao said 2 days ago:

I know this is too late to be useful, but it only takes 1b per nesting level to track the nesting depth in O(1) time, or just a 64b integer to track the nesting depth in O(n) time. I wonder if there a sublinear algorithm? At best to me, right now, it could made to look like bitstream compression.

jrockway said 2 days ago:

I don't like it when programs use physical limits to trigger physical errors based on the input. "JSON nesting depth of 23456 levels exceeds maximum memory of 23456MB" is a very different thing from being kill -9'd by the OOM killer (after of course it gets your ssh session).

If you decide your limit in advance and bail out of parsing, that's one thing. If you just use up all the stack and cross your fingers, that's a pain.

saagarjha said 3 days ago:

I think you’re missing a link.

bspammer said 3 days ago:

The author has now added this as an aside: https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

blablabla123 said 3 days ago:

It is pointless, also nobody ever cares much how well a text editor can handle obscure text files (file too large, broken encoding, ....) What matters is how it handles well-formatted files, the others aren't worth opening in an editor anyways :)

mike_hock said 3 days ago:

The commenter would do well to RTFA before slinging complaints that have already been addressed in the article.

The article is cool-headed and factual, so don't get hung up on the repo name.

ddlatham said 3 days ago:

To be fair, the article was updated to address those complaints after the comment you're responding to was made.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

gameswithgo said 3 days ago:

It is better to not set limits.

skybrian said 3 days ago:

There is always a limit. The question is whether you want to use up all available memory reading bad input, or reject it sooner. If you're worried about denial-of-service attacks, it makes sense to be explicit about how big an input you'll accept.

vesinisa said 3 days ago:

It would've been interesting to know which platform was the JavaScript test performed on (judging by source code, likely Node.js). However, on my current machine Chrome chokes to stack overflow above nesting level 125,778.

More amazingly, Firefox is currently happily parsing a JSON array with 50,000,000 levels of nesting at full CPU single core utilization and consuming about 12 GB of memory. This was after an array of depth 10,000,000 was successfully parsed after peaking around ~8 GB of RAM. So Firefox's parser seems to be only constrained by available memory!

Also - while this is running in the console the Firefox UI is completely usable and I can browse the web in another tab. Mozilla has come a long way!

E: The 50,000,000 nesting level array parsing eventually failed with "Error: out of memory".

anoncake said 3 days ago:

Great. Is anything else usable too?

vesinisa said 3 days ago:

Pretty much, but RAM utilization is of course closing at 100% so starting any new programs would fail on OS level.

stickfigure said 3 days ago:

My takeaway from this is that you can submit a long string of brackets to REST apps built with Haskell and Ruby/Oj and crash the process and/or machine by consuming all available RAM. Whereas the smarter JSON processors near the top of the list will abort processing pathological inputs before serious harm occurs.

Wait, which JSON parsers are supposed to be "bad"?

Spivak said 3 days ago:

Just because a structure is deeply nested doesn’t mean that it consumes a lot of memory.

And just because your payload is massive and consumes a lot of memory doesn’t mean it’s deeply nested.

I doubt any of the parsers are resilient against storing large documents in memory. What makes them bad is that structures that would easily fit in memory are unparsable.

In most cases your web server imposed a limit on the size of the HTTP payload so you wouldn’t normally be able to deliver a 128GiB JSON file anyway. It’s not like nested structures are zip-bombs.

stickfigure said 2 days ago:

My comment was mostly tongue-in-cheek but there's some truth to it. Since the receiver creates a navigable data structure with 64-bit pointers, every two bytes sent ("[]") becomes 20+ bytes in RAM (low estimate!). An order of magnitude magnification helps.

lovasoa said 3 days ago:

Well, most JSON parsers (haskell being a notable exception) use an amount of memory that is proportional to the size of the input, independently of whether they can parse deeply nested structures or not.

Saying that a parser that does not limit the nesting depth will crash your server does not make any sense. You could as well say that str_replace will crash your server.

vbezhenar said 3 days ago:

You can do exactly the same by submitting endless string.

rurban said 3 days ago:

The problem is not DOS by memory exhaustion, the problem is stack-overflow. You can trivially control the return address of the decoder. On mostly public interfaces!

And I think I was the very first one whose JSON library knew about that, probed for the max depth (which is system dependent) and added hard and soft limits to it. It's not even mentioned in this post.

NobodyNada said 3 days ago:

> You can trivially control the return address of the decoder.

With a stack overflow? Stack overflows usually write to unmapped memory and cause a segmentation fault; how would you use that to control the return address? Or perhaps you were thinking about a stack buffer overflow instead?

https://en.wikipedia.org/wiki/Stack_overflow vs https://en.wikipedia.org/wiki/Stack_buffer_overflow

rurban said 2 days ago:

In this case it's the same. The max stack depth is known, and the overflow easily triggerable and controllable from outside. It cannot get much easier.

Wikipedia is a bit confused.

NobodyNada said 2 days ago:

It’s trivial to cause a stack overflow (as in allocating too much memory and running out of stack space), but how would you overwrite the return address? From what I can tell, the attacker can only cause the application to allocate lots of memory, not to overwrite memory.

rurban said 2 days ago:

Memory DOS will not happen with normal recursion, only a tradional stackoverflow. And by adding some content after the fixed number of [ or {, this will by on the stack as argument to the recursive function. There are so many possible exploits. See eg. https://jon.oberheide.org/files/infiltrate12-thestackisback.... or search for "stack overflow recursion exploit"

abrodersen said 3 days ago:

"The tar archive spec does not specify an upper limit on archive size, but my kernel keeps saying 'no space left on device'. It's obviously a bad kernel."

lovasoa said 3 days ago:

Well, in your example, there is a physical limit on how much space is available. In the case of deeply nested json, we are talking about structures that perfectly fit into memory, and could be decoded in a fraction of a second if only a different algorithm were used.

LukeShu said 3 days ago:

On one hand, yes. On the other hand, the standard Ruby "json" gem can be made to overflow on just 202 bytes of valid JSON input.

jrochkind1 said 3 days ago:

I mean, it'd be a pathological 202 byte JSON, but yeah (I guess it could be a DOS attack of some kind, actually, hmm...)

Ruby is a good example, because the `oj` gem, on presumably the same system, is listed as "infinite." Obviously not truly "infinite", eventually it'll run out of RAM -- but this shows it is _not_ an issue of machine resources really.

As the OP notes, it's because if you implement using recursion, you'll run out of stack in most languages (now let's start talking about tail call optimization...), but this isn't the only implementation choice, you _can_ implement to not have this limitation, and probably should.

biggestdecision said 3 days ago:

Anywhere you accept JSON from untrusted users, you should be catching parse errors anyway.

nwellnhof said 3 days ago:

Tail call optimization wouldn't help with parsing JSON.

imtringued said 3 days ago:

Usually the solution is to create an array on the heap and manage the stack yourself.

loktarogar said 3 days ago:

Nope, it stops itself at 100 levels with a JSON::NestingError.

If you run it as JSON.parse(ARGF.read, max_nesting: false) you get 65492 instead of 101.

seppel said 3 days ago:

> the standard Ruby "json" gem can be made to overflow on just 202 bytes of valid JSON input.

It is not overflowing, but about aborting with with a proper error.

You can argue whether 100 is a reasonable default, but I think it is not too stupid to have a maximum depth here and bail out as soon as possible. Because what will happen if you accept arbitrary nesting? Then next guy in the chain who actually works with the parsed json tree will also have to handle arbitrary nesting. And if you are now not careful, you have some error deep in some quick and dirty user written code (which might actually be exploitable instead of not accepting the json in the first place).

I would say you think you can of the 100 as some kind of input sanitation.

mperham said 3 days ago:

    max_nesting: The maximum depth of nesting allowed in the parsed data structures.
    Disable depth checking with :max_nesting => false. It defaults to 100.
zamadatix said 3 days ago:

Referring to it in bytes purely serves to spread FUD. It's >100 levels of nesting which is a bit silly.

LukeShu said 3 days ago:

I did that intentionally, and not because FUD. I hear ">100 levels of nesting", and think "huge document", which as demonstrated here isn't necessarily true. So I said "202 bytes" to emphasize "not huge document"; to emphasize that deep nesting doesn't really imply much about document size.

zamadatix said 3 days ago:

100+ levels of nesting implies nothing about the size of a JSON file, just that it's an extremely niche use case. OTOH almost all JSON files are 202+ bytes. Couple that with the title and it just leads to confusion.

leni536 said 3 days ago:

>niche use case

Like an attack.

oneeyedpigeon said 3 days ago:

Why 202 bytes? Wouldn’t 101 [ characters suffice?

LukeShu said 6 hours ago:

But then it wouldn't be valid JSON.

jandrese said 3 days ago:

The JSON:XS documentation has this to say about the $max_depth field setting for the parser:

    Second, you need to avoid resource-starving attacks. That means you should
    limit the size of JSON texts you accept, or make sure then when your
    resources run out, that's just fine (e.g. by using a separate process that
    can crash safely). The size of a JSON text in octets or characters is
    usually a good indication of the size of the resources required to decode
    it into a Perl structure. While JSON::XS can check the size of the JSON
    text, it might be too late when you already have it in memory, so you
    might want to check the size before you accept the string.

    Third, JSON::XS recurses using the C stack when decoding objects and
    arrays. The C stack is a limited resource: for instance, on my amd64
    machine with 8MB of stack size I can decode around 180k nested arrays but
    only 14k nested JSON objects (due to perl itself recursing deeply on croak
    to free the temporary). If that is exceeded, the program crashes. To be
    conservative, the default nesting limit is set to 512. If your process has
    a smaller stack, you should adjust this setting accordingly with the
    "max_depth" method.
So accepting an unlimited depth structure by default is probably not a good idea. There's a flag you can set if you need to parse something like that, but the defaults are set to parse anything sensible and reject documents that may be malicious.
lovasoa said 3 days ago:

My takeaway would be "using the C stack to parse JSON is probably not a good idea", not "accepting an unlimited depth structure by default is probably not a good idea".

This limit is entirely artificial, and due only to the design of the parser.

As the documentation you quote says, you should limit the size of JSON texts you accept. If you don't, then you are at risk of a memory exhaustion attack. But this is completely unrelated to the maximum depth you accept. If your parser doesn't use the C stack to parse recursive structures, then deeply nested payloads will parse just fine, without using an abnormal amount of memory.

chrismorgan said 3 days ago:

I disagree with your conclusion. I would instead say that the combination of using the stack to parse JSON and allowing unlimited depth is the problem. In practice, almost no one ever deals with JSON structures even a hundred levels deep (I’d quite seriously guess it to be less than one in a million applications that desire it), so if you artificially cap it to protect against stack overflow you can then safely use the stack, thereby getting better performance by reducing the required allocations, while also protecting against one DoS attack vector.

lovasoa said 3 days ago:

You never "artificially cap it to protect against stack overflow": Most libraries that cap it artificially cap it to a fixed value, that may or may not protect against stack overflow depending on how large the stack already is when you call the parser.

chrismorgan said 3 days ago:

These caps we’re talking about are normally a small fraction of the total stack size, and few applications run anywhere near the total stack size. I acknowledge your point (and had it in mind when I wrote my first comment), but maintain my position that this capping is protecting against the stack overflow DoS attack: if you’re able to get a stack overflow with your capped JSON parser, you were probably already able to get a stack overflow, because it’s not adding overly much to the stack.

The problem that’s being protected against is user-controlled stack overflow—that’s what makes it a DoS attack.

jlokier said 3 days ago:

> As the documentation you quote says, you should limit the size of JSON texts you accept. [...] But this is completely unrelated to the maximum depth you accept.

I'm not sure that's right.

The documentation points out that a deep structure returned from the parser can crash the interpreter, because freeing data also uses recursion. ("due to perl itself recursing deeply on croak to free the temporary".)

Anyway, I found the 512 depth limit too much... When using coroutines in Perl, the stack is smaller and it crashes trying just to print a JSON object with depth 512. I hit the limit at about depth 150, with nothing else on the C stack, so in robust diagnostic code that uses the JSON formatter to log data structures, I limit the max depth to 20.

zenhack said 3 days ago:

I'd argue this is a pitfall that languages can and should solve. There's no reason why writing a naive recursive descent parser should cause this problem. The fix, using a separate stack data structure, is telling, since it's the same algorithm, with the same space complexity, you're just saying "my language's call stack isn't a usable implementation, so I need a different stack." Why not just fix the call stack?

In languages like rust/c/c++, this might be hard, since you can't grow the stack by moving them (since you'd need to update pointers, which are too visible in those languages, and even identifying a pointer at runtime is undecidable), and segmented stacks cause the "hot-split" problem.

But most languages could just make this problem go away by growing the call stack when needed. A few of them (incl. Haskell) do. I'm following suit in a language I'm working on.

anaphor said 3 days ago:

Most purely functional languages (like Haskell) don't really have a "call stack", but the computation is carried out by graph reduction, so it can obviously detect cycles and do some optimizations there, depending on the type of recursion.

If you want the gory details https://www.microsoft.com/en-us/research/wp-content/uploads/...

zenhack said 3 days ago:

Yeah, I've read the STG paper, though its been a while. IIRC there isn't a call stack as we'd normally think of it, but there's still a stack of pending pattern matches, so you actually do hit the same issues. I vaguely recall making deep recursion "just work" in GHC was something that happened just in the past couple years.

jlokier said 3 days ago:

> In languages like rust/c/c++

In another HN thread there is much discussion about Rust using, in effect, segmented stacks to implement async-await.

Perhaps you could code a recursive descent parser in Rust using async functions, so that the Rust compiler converts it to a heap-allocated state machine for you.

jwilk said 3 days ago:
bloody-crow said 3 days ago:

The repo is calling it bad, but in production I'd actually prefer a parser that throws an approptiate error based on a setting that can be configured with some reasonable defaults over something that just silently chugging along until it runs out of memory.

I can imagine it'd be a lot easier to overwhelm and DDoS a server that attempts to parse incoming JSON requests without any depth bounds too.

Spivak said 3 days ago:

Just because your data structure is deeply nested doesn’t mean it takes a lot of memory.

[[[[]]]] is still only 4 objects worth of memory on a stack.

lovasoa said 3 days ago:

Most JSON parsers use an amount of memory that is proportional to the size of the input, independently of whether they can parse deeply nested structures or not.

A parser that limits the maximum depth of the input can still be made to consume gigabytes of memory on an input of several gigabytes, and there is nothing wrong about that.

Size limits on a production server should be enforced before even starting to parse the JSON, but this has nothing to do with the issue highlighted here.

mojzu said 3 days ago:

I think the reason that the nesting level limit is 128 and the file size is 256 bytes for serde_json in this test is that the recursion_limit attribute is not configured.

https://doc.rust-lang.org/reference/attributes/limits.html

The default value of this attribute is 128 which would explain both results, I'm guessing with configuration you could significantly increase both values at the expense of compilation time.

edflsafoiewq said 3 days ago:

That's for the recursion limit on compile time stuff, like how many layers of macros it will expand.

serde does have this though: https://docs.rs/serde_json/1.0.41/serde_json/struct.Deserial...

dmit said 3 days ago:

Interestingly, I recently encountered an issue in the wild where these two recursion limits are both prominently involved.

Yew, the Wasm web framework, provides an html! macro that allows you to write templates that look like HTML within your Rust code. The macro is notorious for requiring the macro expansion `recursion_limit` to be raised even for fairly small templates. Now, if you happen to have a syntax error in a template and ask cargo to return the build errors as JSON, it will diligently report every macro expansion that led to the problem as a nested structure. Which is what the commonly used cargo-web tool does, so when it sees JSON with 128+ levels of macro expansion details describing a build error (again, not uncommon with Yew's html! macro), serde_json fails to parse it with `ErrorCode::RecursionLimitExceeded` and cargo-web panics.

https://github.com/yewstack/yew/issues/661#issuecomment-5467...

eptcyka said 3 days ago:

The recursion limit in the deserializer is a deliberate one https://github.com/serde-rs/json/pull/163

mojzu said 3 days ago:

For some reason I thought serde was using macros to perform deserialisation. Now I know that isn't the case and it's a coincidence the default recursion limit and the serde default nesting limit is 128.

lovasoa said 3 days ago:

Even if serde does indeed use macros, they run at compile time. The limit on macro recursion depth has an impact on what programs do compile, but it has no impact on the runtime behavior of these programs.

et2o said 3 days ago:

Is this relevant/are there situations where people nest JSON even 100x/1000x? I have not come across something like that.

bargle0 said 3 days ago:

The question isn't necessarily what might be found in nature, but what might be intentionally constructed in order to attack something. Without knowing more, the segfault in the C++ library is somewhat alarming.

kibwen said 3 days ago:

On the topic of malicious input, the serde_json example seems to be deliberately limiting recursion depth in order to head off DoS attacks:

"serde_json has a recursion limit to protect from malicious clients sending a deeply recursive structure and DOS-ing a server. However, it does have an optional feature unbounded_depth that disables this protection. It still uses the stack though, so it will still eventually blow the stack instead of using all available memory"

https://github.com/lovasoa/bad_json_parsers/issues/7

This begins to suggest that it may be downright desirable for servers to toss out deeply-nested JSON, regardless of the JSON specification.

shadowgovt said 3 days ago:

Google made a similar design decision regarding protobuffers (https://developers.google.com/protocol-buffers/docs/referenc...).

vbezhenar said 3 days ago:

Why not implement parsing without recursion? It doesn't look like a particularly hard task for me. Why restrict users with artificial limitations?

zamadatix said 3 days ago:

Why make something more complicated to add capability users aren't asking for.

vbezhenar said 2 days ago:

Why do you decide for all users? I'm user and I'd like to have a reliable parser able to parse anything given sufficient memory. That's the point of library functions: provide flexibility for any use case.

tracker1 said 3 days ago:

If you have a service that takes in JSON from unknown sources then it's a pretty easy DDoS vector for abusing memory limits. Send a few hundred k of nested {"a":[{"a":[...]}]} that's easy to generate from a couple different CnC clients and you can kill a server pretty readily.

Similar for password login systems that don't restrict bad attempts by IP/Username. If you know a user exists you can send many requests for username:passphrase, and the hashing time has a cost... good passphrase hashing has a pretty high cost usually 1/10 of a second on a single core, or more. If you aren't tracking attempts, then you can easily topple a server with a few hundred simultaneous requests usually.

Anything that is an unguarded memory or compute hog can be used as a pretty easy DDoS vector unless otherwise mitigated somehow. I usually restrict 5-10k regardless, and chunk file uploads into separate requests. Restricting recursion for JSON predictable is a pretty great option as well.

loeg said 3 days ago:

Whether the stack is the implicit C stack or an explicit heap stack does not change anything about the argument you've made. I.e., your comment is not at all responsive to Grandparent's:

> Why not implement parsing without recursion?

Tomte said 3 days ago:

Because even if the stack doesn't run out, it still processes for a long time. Non-recursion doesn't help you there, you need to look at every byte, after all.

shadowgovt said 3 days ago:

And it's worth noting that many JSON parsers assume the problem domain is "shipping data over a network," where the input may be malicious. If the non-recursive parser has an N^2 piece of algorithm embedded in it, its ability to consume large data structures becomes a DOS liability.

marcosdumay said 3 days ago:

You'll need a stack anyway, because the structure is logically a tree. Better to use the call stack, that is already there and heavily optimized.

Spivak said 3 days ago:

Naive person here. Isn’t the call stack a lot heavier than pushing data structures on a std::stack?

MaulingMonkey said 3 days ago:

A recursive parsing call stack has the overhead of pushing/popping return addresses and maybe a couple other registers. Maybe a bit more if there are security protections like stack cookies that didn't get optimized away, or if shadow stacks are enabled. It's not free, but is pretty cheap. A non-recursive use of std::stack dodges some of that, but instead has the overhead of dynamic allocation, which I'm generally going to consider more expensive - but does have the advantage of being more of a once-off cost - so perhaps it'd be cheaper for long but shallow JSON objects? The cost of the call stack vs std::stack is probably dwarfed by the cost of allocating arrays/objects/strings and decoding in either case.

If you want to over-optimize for that specific case anyways, you can take a hybrid approach with smaller stacks kept on the stack in a single linear root blob, with larger stacks falling back onto the heap, using a non-recursive style.

However, if you're capping JSON depths as an anti-DoS measure anyways (your parser isn't the only thing possibly vulnerable to blowing the stack via recursion - anything dealing with the parsed result is also in danger if it can be recursive), the heap fallback may be entirely dead code anyways... so why bother coding it?

__david__ said 3 days ago:

It depends on how big the stack frame of the recursive function is vs how big the amount of data needs to be saved fore each recursion level.

But… most likely yes.

zeroimpl said 3 days ago:

You'll need a data structure to store the results to return to the caller. This data structure by definition cannot be the call stack. So using a call stack is technically redundant.

lovasoa said 3 days ago:

Exactly. Another example is python's standard library json parser that is documented to throw a JSONDecodeError on invalid input. But if the invalid json contains deeply nested structures, it raises a RecursionError, and this is not documented anywhere. It's easy to imagine how a code that only catches JSONDecodeError while decoding untrusted inputs could be exploited by an attacker.

Spivak said 3 days ago:

Ugh Python exceptions are always a source of anxiety for me. Without combing through the source and locking yourself to specific versions it’s nigh impossible to be sure you’ve caught everything without a “catch everything” block.

anoncake said 3 days ago:

Documented or not, you shouldn't rely on the lib only throwing JSONDecodeError in security-critical code.

Spivak said 3 days ago:

But what can you do instead? It’s a really hard problem to validate generic JSON without accidentally parsing it.

zeroimpl said 3 days ago:

Validation better be done by the parser. Otherwise, subtle differences between the validator and the parser can lead to security flaws. Your solution is to make sure you use a good parser.

thegeomaster said 3 days ago:

When the OS-provided stack is overflown, the program accesses a "guard page" (or several) which is explicitly non-readable by the process, which in turn triggers a segfault. It's not an exploitable issue, but in the interest of robustness, the library probably should've failed more gracefully instead of blowing through the stack.

saagarjha said 3 days ago:

Is it any more alarming than the other results?

fireflies_ said 3 days ago:

You probably wouldn't see it in "natural" use but it's a potential attack vector if you ever deserialize arbitrary JSON (e.g. JSON bodies in API requests). Of course you should have general limits that would catch these things anyway...

skykooler said 3 days ago:

General limits for what? If you can segfault a program with 28kb of JSON, that's far below the maximum many APIs will return, so you'd need to do some sort of pre-parsing of the JSON to determine the nesting level before parsing it...

vbezhenar said 3 days ago:

For example Java won't segfault. It'll throw stackoverflow exception which will be caught and served with some standard error like HTTP 500. It shouldn't affect other queries.

tracker1 said 3 days ago:

Well, recursion tests would be more for input... I usually limit all POST data to 5-10k and don't like to go over that. I also tend to use chunked uploads, that are a little slower, but serve to limit the size and time of any inbound work.

shadowgovt said 3 days ago:

Procedurally-generated JSON can nest data as deep as it needs to to describe arbitrarily-complex data, but I'd usually consider it an indicator that a problem thought to be simple has overflowed the bounds of its initial assumptions when it happens.

vbezhenar said 3 days ago:

If you're representing tree data structure in a natural way, you can easily run into this limitation on a real world data.

adrianN said 3 days ago:

I have seen JSON output with thousands of levels of nesting. It was generated by a sort of compiler tool that compiled to JSON.

jchw said 3 days ago:

They’ve changed the title[1] to remove the phrasing “bad,” I think the title should be updated here as well, even if the repo name still reflects the original title. It’s not reasonable to use this single criteria as a measure of “badness,” as there are many different variables. I would place security/memory safety 1000x higher than recursion limits. I have never hit JSON recursion limits in production.

[1]: https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

k__ said 3 days ago:

I know, Halloween is over, but currently I'm working with an API where the returned JSON is created by manual string concatinations. Sometimes commas and quotation marks go missing.

kardos said 3 days ago:

Does the API specify under which conditions you can expect missing characters? There should be an undefined behaviour angle here somewhere

k__ said 3 days ago:

I wish.

The dev simply writes new string concatination code every time a new endpoint is needed, so I can't even hope for some bugs being gone for good after he tested his implementation with a few endpoints.

cloudkj said 3 days ago:

Coincidentally, I just wrote a simple JSON parser the other day as a toy exercise. A simplified snippet of parsing code (handling only arrays) relevant to the discussion here would be something like:

  def parse(text):
      stack = []
      index = 0
      result = None
      while index < len(text):
          char = text[index]
          val = None
          if char == '[':
              stack.append([])
          elif char == ']':
              val = stack.pop()
          index += 1
          if val is not None:
              if stack:
                  stack[-1].append(val)
              else:
                  result = val
      return result
Using the test utilities in the repo indicate that the parsing logic can handle arbitrarily nested arrays (e.g. up to the 5,000,000 max in the test script), bound by the limits of the heap.

It seems like the main criticism here is against recursive implementations. Or am I missing something?

sansnomme said 3 days ago:

A better question to ask is: Is it a success of CS education since developers know when to make tradeoffs in software engineering and go for the most easily understandable code (recursive implementation used in most pseudo code) since deeply recursive cases are rare, or is it a failure that software engineers cannot figure out how to convert a recursive algorithm into an iterative implementation for greater memory efficiency?

kadoban said 3 days ago:

It doesn't necessarily mean anything that deeply nested cases are rare. If you're making a general library, someone is going to throw user input at it, and some users will try to break your system. If it can avoid a crash in that case, it usually should.

twir said 3 days ago:

One of JSON's selling-points is to be human-readable. I would struggle reading any JSON document with a nesting level greater than 6 or so. Why would I ever want a nesting level > 100?

mhd said 3 days ago:

Human-readable as in "not binary". Other than that, it doesn't offer a lot that helps in that regard.

hombre_fatal said 3 days ago:

This doesn't really make sense to me. Just because JSON is human-readable doesn't mean that humans are meant to read any possible serialized data nor that it should always be an ergonomic read. Not everything you serialize is just a config file or hand-edited data. It may be arbitrarily dense with arbitrary data.

To me, this is kinda like seeing a JSON array of user IDs coming over the network and going "Huh? Why not usernames? I thought JSON was supposed to be human-readable!"

twir said 3 days ago:

I see your point; although if you look at the specs for a lot of easy-to-parse formats for computers, a stated design goal is also easy-for-humans (e.g. Markdown, YAML).

Large, complex object hierarchies with lots of nesting might make more sense represented in binary (e.g. Avro).

I realize I'm making a little bit of a McLuhan-esque argument in a largely computer science-oriented context, but I hope you can see what I'm getting at.

ken said 3 days ago:

Another example: source code is human readable (by definition), but that doesn't mean I can read any possible program. I struggle to read many programs with deep nesting, but that doesn't mean I'd find it acceptable for my compiler to crash in such situations.

anoncake said 3 days ago:

Maybe because you use a JSON library as all the time anyway so using it even when it's not ideal but good enough is the path of least resistance. Or you're a cargo cultist.

tracker1 said 3 days ago:

When dealing with interconnected systems it's a pretty decent format. That it is human readable also allows for relatively easy interaction in terms of the value without building one-off tooling.

When I'm dealing with non-sql data, one thing I've also done is fire the record into a service that compresses to .gz and a worker sends it to S3_PATH/data-type/recordid.json.gz as a recovery option. Not that it's the first or best recovery option, but when your core data is only a few hundred thousand records, potentially faster than a full DB restore. Not to mention, services being written in one language, and a recovery script in a scripting language.

It just depends on how you are doing things. Now, for configurations, I tend to prefer TOML or YAML.

philbo said 3 days ago:

Shameless self-promotion: I wrote a JSON parser for node.js that specifically solves this problem of parsing deeply-nested and/or huge JSON payloads.

https://www.npmjs.com/package/bfj

It's entirely asynchronous, yielding frequently to avoid monopolising the event loop. And it uses a fixed-length buffer so it never exhausts available memory. The tradeoff is that it's slow. But afaik it parses JSON payloads of any size and any depth.

jackcviers3 said 3 days ago:

Recursive parsers need not use the stack for recursion, they can use the heap instead through trampolining, and not have an issue until memory is fully consumed. That would impact performance somewhat - so users should also be allowed to choose when they need the trampolined version. They should still limit recursion depth and allow the user to set a value higher than the default.

I can't help but think that the lazily evaluated Haskell version score of infinity is probably somewhat misleading, as once you access something within a large enough tree you'll probably run out of memory pretty quickly.

saagarjha said 3 days ago:

It’d be nice if it was possible to annotate recursive functions as using the heap instead of the stack for recursion. I’ve seen too many implementations of people writing a poor man’s stack frame to store on the heap and it just totally destroys any sort of elegance the original code had.

SamBam said 3 days ago:

Hence "available RAM is the only limit" for Haskell.

twic said 3 days ago:

I thought i'd try some other Java parsers, but the Maven build doesn't work with current Maven, and it's not clear what you'd feed to test_parser.sh anyway, as there's no driver script for the Java case.

One of the things i'm passionate about is always setting up a project in such a way that someone new to it can download, build, and use it with an absolute minimum of fuss. That means making the most of the build tools, and scripting everything you can't do through the build tool. I don't always meet this standard myself, because i am a weak and wretched human being, but it's great when it's done.

Gradle is good here, because the wrapper means you always get a specific Gradle version, although it can't control the JDK version, which is increasingly painful in the post-8 age. Cargo is pretty good, because the user just needs to install rustup, then they get a precisely controlled Rust and Cargo version via the rust-toolchain file. I'd love it if there was a rustup-wrapper which i could check in, which would obviate the need for a global rustup installation. Ruby, Python, and Node do alright, but you have to ask the user to install the version manager of your choice. C is absolutely miserable.

Since this project has all of the above and more, making it as self-sufficient as i would like would be quite an effort!

lovasoa said 3 days ago:

I agree the project could use some better documentation. You can run the test with any program by running

    ./test_parser.sh my_program [arguments]
Where my_program is a command that reads json on it's standard input, and returns a non-zero return code if the parsing failed.
megous said 3 days ago:

How are C build systems miserable? If you're building for a general Linux distro, you'll not have much control over the dependencies, but that's not a problem with C, but with how programs are distributed in the GNU/Linux distribution model.

twic said 3 days ago:

AFAIK, there is no tool that lets you specify a compiler version in your project, then automatically build with that compiler. The equivalent of rustup, rvm, nvm, etc.

There is also no standard package manager.

It is very hard to persuade GCC to use a specific set of libraries that you have obtained using a package manager, or vendored or whatever, rather than getting distracted by random libraries it happens to find in in /lib etc. The root of the pain is that you really want to use the host system's platform libraries, the ones which implement the POSIX API, but once the compiler can see those, it tends to go off looking for other things in the same place. In my experience, at least!

These deficiencies are tied up with the historical interrelationship of C and the operating system, but that doesn't mean that they aren't real.

cellularmitosis said 3 days ago:

Related question: why is the stack limit so tiny in modern operating systems? Why do we allow programs free reign of the heap but not the stack?

Edit: I hadn’t considered that all of a processes’s threads share the same address space. In a 32-bit address space, a 4MB stack limits you to 1024 threads (with nothing left over for heap).

However, perhaps the question still stands: why use a small stack size on a 64-bit system.

jlokier said 3 days ago:

One of the reasons is that if you momentarily use a very large amount of stack to calculate something (say gigabytes), and then return, the large part of the stack which is no longer used stays allocated, full of unused data taking up RAM for the remaining life of the process, which might be a long time.

If a heap-allocating algorithm does the same thing, it's likely to free the memory when it's finished with it.

skrebbel said 3 days ago:

the sort order is incorrect. he says they sort from worst to best, but the worst one is the one that can segfault and the perfectly good ones are all the other ones that, if i read this correctly, just throw an exception.

ape4 said 3 days ago:

I expecting a report on parsers that crash for challenging syntax. eg: { hello: 'world\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' }

lovasoa said 3 days ago:

Independently of the backslashes, this is not valid JSON: keys have to be quoted, and JSON uses double quotes, not single quotes...

cerberusss said 3 days ago:

Hey look, a human-based JSON parser. I wonder what their maximum nesting level is...

dmit said 3 days ago:

Over-under on when the first "ML-based approach to parsing JSON" story hits HN: 1.5 years from now.

mynegation said 3 days ago:

Just couple of weeks ago I had to debug a production problem when Jackson parser failed on parsing prod-specific configuration (in JSON). Turns out the problem was a trailing comma in the array of values. IntelliJ IDEA JSON syntax highlighter did not flag it, due to some obscure setting.

I also had problems with some JSON parsers allowing NaN and Infinity as double values and others - failing on them.

rpmisms said 3 days ago:

For a moment, I was worried that someone had found the JSON parser I wrote in Python years ago. My career would be over.

ascotan said 3 days ago:

Regardless of the quality of this post, the author does highlight a security issue here. The RFC states that it's up to the parser to define the limits and different parsers may behave differently in the face of malicious inputs. That's why you should probably test them...

MadWombat said 3 days ago:

Nice effort, but seems to miss things. RapidJSON for C++ is a very popular library. And Go has both the standard library JSON parser and a couple of other popular implementations. I would gladly add implementations for Go, but it is unclear how do do that.

lovasoa said 10 hours ago:

I added RapidJSON to the tests. It segfaults at 87,289 nested arrays.

moderation said 3 days ago:

Plans are to remove RapidJSON from Envoy proxy - https://github.com/envoyproxy/envoy/issues/4705

shadowgovt said 3 days ago:

Since the document is on GitHub, it's perhaps a safe assumption the author may be amenable to pull requests (and if they're not, you can always fork their repo and add additional information; they MIT licensed the whole thing).

lovasoa said 3 days ago:

Indeed, I'd love to get pull requests. It's mentioned in the "Remarks" at the end.

kardos said 3 days ago:

Right at the bottom:

> If you want to add a new language or a new library, feel free to open a pull request.

bjoli said 3 days ago:

Guile scheme expands the call stack until until memory is exhausted and then raises an exception. That is pretty damn comfy, because it means that building lists with recursion is just as fast as using tail recursion and a reverse! at the end

bchociej said 3 days ago:

Interesting results e.g. on Ruby. I did give the javascript test file a whirl and was able to run it just fine with a nesting level of 1,000,000, so I wonder if something else is happening there.

kbumsik said 3 days ago:

JSON nested more than 100??

Why calling a JSON parsers are "bad" when the JSON object is already gone far beyond "bad"?

PLEASE don't call anything bad just because it doesn't fit in your own standard.

justinpombrio said 3 days ago:

Much more detailed comparison of JSON parsers: http://seriot.ch/parsing_json.php

cobbzilla said 3 days ago:

For the Java test, I’d love to see how Jackson compares to Gson.

lovasoa said 3 days ago:

Someone just contributed a pull request with results for Jackson in java

cobbzilla said 2 days ago:

And it’s already been merged (nice turnaround time!). It narrowly beats Gson.

tankfeeder said 3 days ago:

PicoLisp successfully parsed 20M nested brackets, its maximum i can do on 8GB laptop.

Joyfield said 3 days ago:

"Ubuntu Linux 4.10.0-35-generic SMP x86_64 with 8Gb RAM" Gb != GB.

msoad said 3 days ago:

if you allow a JSON parser use all of memory to go deep, you know what attackers will do... As other commenters mentioned JSON spec allows setting a limit on the depth.

zenhack said 3 days ago:

It isn't a matter of memory usage; the memory usage is still linear in the size of the input regardless of the depth. If an attacker can get you to oom by feeding you a deeply nested data structure, they can do the same with a shallow one just as easily:

"[0, 0, 0, 0, 0, <...gigabytes of zeros>, 0 0, 0, 0]"

So you need to limit the overall input size regardless of whether you already have a depth limit in place.

lovasoa said 3 days ago:

The memory used by parsers that do not limit the maximum depth is still bounded by the size of the input.

An application that does not limit the maximum size of its input in bytes is vulnerable, one that does is not. This is completely independent of the JSON parser.

jcmontx said 3 days ago:

Has anyone run this for Newtonsoft.Json?

Roonerelli said 3 days ago:

there's a .net parser in the source code but not in the results

lovasoa said 3 days ago:

Would you be interested in making a PR adding the .NET project to travis and the results to the README ?

vbezhenar said 3 days ago:

Java Jackson is bad too.