Intuiting Pratt Parsing (louis.co.nz)

by signa11 48 comments 169 points
Read article View on HN

48 comments

[−] logdahl 45d ago
Love Pratt parsing! Not a compiler guy, but I've spent way too many hours reflecting on parsing. I remember trying to get though the dragon book so many times and reading all about formal grammar etc. Until I landed on; recursive descent parsing + Pratt for expressions. Super simple technique, and for me is sufficient. I'm sure it doesn't cover all cases, but just for toy languages it feels like we can usually do everything with 2-token lookahead.

Not to step on anyone's toes, I just don't feel that formal grammar theory is that important in practice. :^)

[−] eru 44d ago
The Dragon book is not very good, to be honest.

It was probably decent when all you had was something like Pascal and you wanted to write a C compiler.

Parsing and compiling and interpreting etc are all much more at home in functional languages. Much easier to understand there. And once you do, then you can translate back into imperative.

For parsing: by default you should be using parser combinators.

[−] pklausler 44d ago
Is there a production compiler out there that doesn't use recursive descent, preferably constructed from combinators? Table-driven parsers seem now to be a "tell" of an old compiler or a hobby project.
[−] ogogmad 44d ago
Some people appreciate that an LR/LALR parser generator can prove non-ambiguity and linear time parse-ability of a grammar. A couple of examples are the creator of the Oil shell, and one of the guys responsible for Rust.

It does make me wonder though about why grammars have to be so complicated that such high-powered tools are needed. Isn't the gist of LR/LALR that the states of an automaton that can parse CFGs can be serialised to strings, and the set of those strings forms a regular language? Once you have that, many desirable "infinitary" properties of a parsing automaton can be automatically checked in finite time. LR and LALR fall out of this, in some way.

[−] pklausler 44d ago
Production compilers must have robust error recovery and great error messages, and those are pretty straightforward in recursive descent, even if ad hoc.
[−] eru 44d ago
Oh, I was talking much more about how you can first learn how to write a compiler. I wasn't talking about how you write a production industry-strength compiler.

Btw, I mentioned parser combinators: those are basically just a front-end. Similar to regular expressions. The implementation can be all kinds of things, eg could be recursive descent or a table or backtracking or whatever. (Even finite automata, if your combinators are suitably restricted.)

[−] pklausler 44d ago
I used a small custom parser combinator library to parse Fortran from raw characters (since tokenization is so context-dependent), and it's worked well.
[−] dbcurtis 44d ago
The thing about LR parsers is that since it is parsing bottom-up, you have no idea what larger syntactic structure is being built, so error recovery is ugly, and giving the user a sensible error message is a fool’s errand.

In the end, all the hard work in a compiler is in the back-end optimization phases. Put your mental energy there.

[−] pklausler 43d ago
I have worked on compilers (mostly) for high-performance computing for over 40 years, writing every part of a production compiler twice or more. Optimization and code generation and register allocation/scheduling are definitely the most fun -- but the hardest work is in parsing and semantics, where "hardest" means it takes the most work to get things right for the language and to deal with user errors in the most graceful and informative manner. This is especially true for badly specified legacy languages like Fortran.
[−] joe_the_user 44d ago
The Dragon book wasn't good for me either but I'd disagree about using parser combinators. The problem that I'd see the Dragon book having is basically starting to use concepts (phases of compilation) before it introduces and motivates them in the abstract. I can see how people who already know these concepts can look at the Dragon book and say "oh, that's a good treatment of this" so perhaps it's good reference but it's problematic for a class and terrible to pick up and try to read as a stand alone (which I did back in Berkeley in the 80s).

As far as I can tell, parser combinators are just one way that promises to let "write a compiler without understanding abstract languages" but all these methods actually wind-up being libraries that are far complicated than gp's "recursive descent + pratt parsing", which is easy once you understand the idea of an abstract language.

[−] eru 44d ago
I wrote a bunch of parser combinator libraries myself, and used plenty more.

They are basically the same idea as regular expressions, but more flexible: you have a bunch of combining operations for your regular languages to build bigger regular languages. But that doesn't tell you how it's implemented in the backend. Could be Recursive, could be automata, could be backtracking, could be anything.

If you want to write your first compiler, I'd even go so far as to suggest to use something with a Lisp syntax as your source language, explicitly so you can minimise the parsing aspect.

Parsing is a lot of fun by itself, but it doesn't actually have much to do with the core of what makes compilers interesting and challenging. It's almost an independent pursuit, and very useful outside of writing compilers, too.

> As far as I can tell, parser combinators are just one way that promises to let "write a compiler without understanding abstract languages" but all these methods actually wind-up being libraries that are far complicated than gp's "recursive descent + pratt parsing", which is easy once you understand the idea of an abstract language.

Where do you suspect the complexity here? You can write a toy library for parser combinators that's really simple, if your implementation language is at least as capable as Rust or even Python (with Haskell and OCaml being probably the easiest). If you are using an off-the-shelf industrial-stength, production-grade parser combinator library: of course, that's complicated under the hood. That's the price the authors wilingly pay for great error handling and performance etc.

For most people writing their first compilers, they would better off starting the project from when they already have some tree or DAG representation. Plenty of (more!) interesting challenges left. Going from stream of bits to the parsed syntax tree is something they can learn about later (or not at all), without missing much.

[−] dbcurtis 44d ago
I was just going into the second quarter of compiler design when the dragon book came out. My copy was still literally “hot of the press” — still warm from the ink baking ovens. It was worlds better that anything else available at the time.
[−] eru 44d ago
Oh, I don't doubt that in the bad old days the Dragon book was a step forward. It's just pretty bad compared to what you can get today.
[−] joe_the_user 44d ago
Not to step on anyone's toes, I just don't feel that formal grammar theory is that important in practice. :^)

Well, it depends how formal you're talking about. I have to say that the standard you mention, recursive descent parsing + Pratt for expressions. actually requires you to understand what a formal language is - that it's a "thing" that can't (or shouldn't) be an object or a data structure but exists abstractly before any objects created by the program.

Moreover, the standard way of producing a recursive descend parser is to begin with your language in Chomsky normal form or some human understandable format and then convert to Greibach Normal form and that specification converts readily to your series of recursive functions. So all language transforms are useful to know (though you can skip steps if you have a good intuition of your language).

[−] ogogmad 44d ago
Quick other one: To parse infix expressions, every time you see "x·y | (z | w)", find the operator of least binding power: In my example, I've given "|" less binding power than "·". Anyway, this visually breaks the expression into two halves: "x·y" and "(z | w)". Recursively parse those two subexpressions. Essentially, that's it.

The symbols "·" and "|" don't mean anything - I've chosen them to be visually intuitive: The "|" is supposed to look like a physical divider. Also, bracketed expressions "(...)" or "{...}" should be parsed first.

Wikipedia mentions that a variant of this got used in FORTRAN I. You could also speed up my naive O(n^2) approach by using Cartesian trees, which you can build using something suspiciously resembling precedence climbing.

[−] duped 44d ago
An even easier approach is to give all infix operators the same precedence and force the programmer to group subexpressions.
[−] gignico 45d ago
Until you need to do more than all-or-nothing parsing :) see tree-sitter for example, or any other efficient LSP implementation of incremental parsing.
[−] marssaxman 44d ago
I am a compiler guy, and I completely agree. Parsing is not that hard and not that important. Recursive descent + pratt expressions is almost always the practical choice.
[−] randomNumber7 44d ago
It's not for toy languages. Most big compilers use recursive descent parsing.
[−] signa11 44d ago

> Not to step on anyone's toes, I just don't feel that formal grammar theory is that important in practice. :^)

exactly this ! a thousand times this !

[−] deterministic 44d ago
Professional compiler writer here. All you really need to use is a recursive descent parser. Very easy to understand. Very easy to implement. While also being very powerful.
[−] randomNumber7 44d ago
I can recommend anyone reading pratts original paper. Its written in a very cool and badass style.

https://dl.acm.org/doi/epdf/10.1145/512927.512931

[−] IshKebab 44d ago
Also if you're looking into this area you'll find there is another algorithm called "Precedence climbing", which is really the same thing with some insignificant differences in how precedence is encoded.

There's also the "shunting yard" algorithm, which is basically the iterative version of these algorithms (instead of recursive). It is usually presented with insufficient error checking, so it allows invalid input, but there's actually no reason you have to do it like that.

[−] caspianm 44d ago
"I’ve read many articles on the same topic but never found it presented this way" it reminds me a lot of a description I saw in a video with Jonathan Blow talking about precedence and parsing with Casey Muratori.

The video is 3 hours long though, and I'm not sure the text he shows is available.

At this point he's talking about left leaning vs right leaning trees, after having already talked about one of them: https://youtu.be/fIPO4G42wYE?t=2256&si=aanthLGe-q8ntZez

[−] antirez 44d ago
The latest implementation of Picol has a Tcl-alike [expr] implemented in 40 lines of code that uses Pratt-style parsing: https://github.com/antirez/picol/blob/main/picol.c#L490
[−] erichocean 44d ago
Also discussed in detail in Crafting Interpreters.[0]

[0] https://craftinginterpreters.com/

[−] svat 44d ago

>

I’ve read many articles on the same topic but never found it presented this way - hopefully N + 1 is of help to someone.

Can confirm; yes it was helpful! I've never thought seriously about parsing and I've read occasionally (casually) about Pratt parsing, but this is the first time it seemed like an intuitive idea I'll remember.

(Then I confused myself by following some references and remembering the term "precedence climbing" and reading e.g. https://www.engr.mun.ca/~theo/Misc/pratt_parsing.htm by the person who coined that term, but nevermind — the original post here has still given me an idea I think I'll remember.)

[−] Abbit 44d ago

> But of course, people (for the most part) don’t write programs as trees.

Such a beautiful reference to Lisp.

[−] priceishere 44d ago
An even simpler way imo, is explicit functions instead of a precedence table, then the code pretty much has the same structure as EBNF.

Need to parse * before +? Begin at add, have it call parse_mul for its left and right sides, and so on.

  parse_mul() {
    left = parse_literal()
    while(is_mul_token()) { // left associative
      right = parse_literal()
      make_mul_node(left, right)
    }
  }

  parse_add() {
    left = parse_mul()
    while(is_add_token()) { // left associative
      right = parse_mul()
      make_add_node(left, right)
    }
  }
Then just add more functions as you climb up the precedence levels.
[−] hyperhello 44d ago
You can either use the stack in an intuitive way, or you can change the tree directly in a somewhat less intuitive way without recursion. Essentially either DF or BF. I don’t see how it matters much anymore with stacks that grow automatically, but it’s good to understand.
[−] dpratt 44d ago
I will never forget the amusing attention I got from the professor when this topic was covered during my undergrad. It's only happened once, sadly, but this is only seconded by the time I was assisting a junior engineer with a related problem and was able to say "Oh, that's just a Pratt Parser. Let me show you."