Parsing with Rust - Part 3: completing the Tree-sitter grammar
The third post in a series on writing programming-language parsers in Rust. In Part 1, we covered general parsing concepts and looked at some of the Rust crates available for generating a parser. In Part 2, we started writing a Tree-sitter grammar for WDL and used it from Rust. In this post, we’ll implement some of the more interesting parts of the grammar and learn about the rest of the Tree-sitter DSL in the process. The complete WDL grammar is available in the GitHub repository.
Parsing expressions: precedence and associativity⌗
A Tree-sitter parser encounters a conflict when two or more rules match the current context. The parser has mechanisms for resolving conflicts, one of which must yield a single matching rule for parsing to be able to continue.
The most basic mechanism for resolving conflicts is to compare the precedences of the conflicting rules. Every rule in a Tree-sitter grammar has an integer-valued precedence; if exactly one of the conflicting rules has a higher precedence than the others, it is selected as the matching rule.
The most common scenario in which precedence is needed to resolve conflicts is in the parsing of expressions. A WDL expression is a literal value (e.g., number or string), an identifier (a reference to a declared variable), or an operation (e.g., addition or multiplication).
Parsing expressions can be challenging for two reasons:
- Expressions can be left-recursive. Any operand on the right-hand-side of an expression may be an expression itself. Although Tree-sitter parsers can handle left recursion, we learned in Part 1 that some types of parsers cannot, which necessitates the topological ordering of nonterminals (we’ll see an example of this in a later post in this series).
- Operators have a definite ordering. An expression cannot be simply parsed left-to-right, because an operator of higher precedence may occur to the right of one with lower precedence.
For example, consider the expression 1 + 2 * 3
. The parser could generate either of two trees: (1 + 2) * 3
or 1 + (2 * 3)
. We know that the second tree is correct, because multiplication has a higher precedence than addition, but we need a way to communicate this to the parser generator.
To implement operation ordering, tree sitter provides the prec
function to change a rule’s default precedence of 0
to a positive or negative integer value. There are also a few variants of prec
that support additional conflict resolution mechanisms:
prec.left
: Assigns a precedence, but also directs the parser to select the rule that results in the shortest match when there are multiple rules with equal precedence.prec.right
: Assigns a precedence, but also directs the parser to select the rule that results in the longest match when there are multiple rules with equal precedence.prec.dynamic
: Assigns a dynamic precedence, which is used during dynamic conflict resolution. We can write a complete WDL parser without needing dynamic conflict resolution, so we won’t cover it in this series, but you can read about it in the manual.
As an example, let’s write a grammar to parse integer addition and multiplication operations:
Expression grammar |
JavaScript code |
|
|
Our grammar expresses that:
addition
,multiplication
, andinteger
are all expressionsinteger
has higher precedence thanmultiplication
, which has higher precedence thanaddition
- The operands of
addition
andmultiplication
can match expressions recursively with left associativity.
The use of prec.left
in the addition
and multiplication
rules is important because it resolves ambiguity for expressions like 1 * 2 * 3
, which could be parsed as 1 * (2 * 3)
or (1 * 2) * 3
. Precedence doesn’t help here because both operations are multiplication, and thus have the same precedence. Left associativity says to choose the second tree over the first.
Improving parse tree ergonomics⌗
We can make some small usability improvements to our WDL grammar that will make it easier to work with the syntax tree generated by the parser.
The first change we can make is to have our grammar better reflect the structure of WDL documents. For example, right now, we have all of the top-level elements as children of document
, but in reality the version
statement is different from the others. We can think of version
as a header - it must appear once as the first statement in the document - while the other nodes comprise the “body” of the document. To mirror this structure in the grammar, we can introduce a new document_body
rule whose children are the rules for the four body elements.
Additionally, we can “hide” rules do not map to concrete syntax elements in our grammar, but exist to group together related rules (typically using a choice
) and eliminate redundancy (Tree-sitter refers to these kinds of rules as “supertypes”). For example, there are many rules in the WDL grammar that match to any expression, so it makes sense to have an expression
rule rather than copy the same choice
statement everywhere. But we can hide the expression rule so that it doesn’t create an extra level of depth in the parse tree. A hidden rule has a name that begins with an underscore (_
). Hiding a rule does not hide its productions.
A third improvement we can make is to add a unique name to each node that we may want to access in the parse tree. This will make it possible to retrieve individual nodes by name rather than having to iterate over all the nodes to find the one we want. Tree-sitter refers to these as “field names”, and they are added by the field
function in the grammar DSL.
Let’s introduce these changes to our grammar. In addition, we’ll add a few more rules to tie everything together:
workflow
can now have a name and a body.- A workflow body has opening and closing braces (
{}
) and contains any number of statements. This is such a common pattern that we add another JavaScript function,block
, to create aseq("{", rule, "}")
given anyrule
. - For now, a
workflow_body
can only containdeclaration
s. - A
declaration
creates a variable with a type (represented by the hidden rule_type
), a name, and a value assignment, which is=
followed by an expression. - For now, we only allow
Int
declarations. workflow
anddeclaration
names areidentifier
s. Anidentifier
must start with a letter and may only contain letters, numbers, and underscores (_
).- An
identifier
may also be used in an expression to refer to a previously declared variable.
Also note the addition of two top-level fields:
word
defines a rule producing anidentifier
. This is necessary for correct parsing of keyword tokens, and is also a performance optimization. The description of why this is necessary is a bit technical - see the manual if you’re interested in the details.supertypes
defines a list of nodes to treat as supertypes. This has no effect on the generated parser; it just adds some extra information to thenode-types.json
file.
A more ergonomic grammar |
JavaScript code |
|
|
These additions to our grammar enable us to write a more interesting test case:
resources/test/interesting.wdl |
WDL |
With these changes to the grammar, we can simplify our test case somewhat. Note that the performance is probably not any better than before - we avoid making one of the copies of the cursor, but instead we’re performing redundant navigation of the parse tree because each call to child_by_field_name
moves the cursor to the first child node, then to each successive sibling node until the node with the given field name is found, then back to the parent node.
Updated unit test |
rust code |
|
|
Parsing string literals⌗
WDL has two types of strings:
- Literals: May contain any characters between pairs of quotes. Either single (
'
) or double ("
) quotes may be used, but the starting and ending quote character must match. Some characters must be escaped. Literals may be evaluated statically (i.e., no runtime context is required). - Interpolated: An interpolated string is just like a literal, except that it may also contain any number of expression placeholders. A placeholder is of the form
~{expression}
, whereexpression
is any expression that evaluates to a value that can be represented as a string. The expression is evaluated at runtime, and the result is converted to a string and replaces the placeholder.
In the example below, the "hello.wdl"
string used in the import
statement is a literal. String literals are also used in the meta
section, which contains arbitrary metadata about a workflow or task and does not allow expressions. The expression assigned to greeting
is an interpolated string.
WDL strings example |
WDL |
Writing the grammar for string literals is much easier than for interpolated strings, so we’ll tackle that first. To write the rules for parsing string literals, we’ll need to introduce a few more functions provided by the grammar DSL:
optional
: Indicates that a rule/literal is not required - if it does not match the next token, the parser may skip it and proceed to the next production.alias
: Changes the name of node in the parse tree. In the grammar, all rules must have a unique name. However, two nodes generated by different rules may have the same semantic meaning - aliasing them to the same name can simplify parsing.token
: Combines all the tokens that match the rule into a single token. For example, theescape_sequence
rule below matches the escape sequence"\000"
as two tokens,\
and000
, and then combines them into the single token\000
.
In addition, we’ll need to use regular expression (regex) literals (rather than strings) to match the variable content allowed in WDL strings. A regex is written as a standard JavaScript regular expression. For example, /[A-Z]+/
would match one or more upper-case characters from the Latin alphabet.
String literal grammar |
JavaScript code |
|
|
We implement the string_literal
rule as a choice between a single-quoted or double-quoted string. A string may be empty, so we make the body of the string optional
. We alias string_literal_squote_parts
and string_literal_dquote_parts
to the same node name (string_parts
) because they are semantically identical, i.e., the type of quoting should not affect how we handle the results of parsing the string body.
The string body consist of one or more “parts”, where a part is either an escape sequence or a sequence of unescaped characters (“content”). It is necessary to use different rules for single- and double-quoted string bodies because an unescaped single quote is not allowed in the former, whereas an unescaped double quote is not allowed in the latter. We alias the two different regular expressions to the same node name because, again, they are semantically identical.
Remember that starting a regular expression character set with ^
indicates negation, which means that any character except those in the set will match. The use of +
after the character set indicates a “greedy” match, i.e., the parser will match as many characters as possible until it encounters either the quote character matching the type of string literal or an escape character (\
). Furthermore, unescaped newlines are not allowed in WDL strings.
Escape sequences include not only special single characters, but also unicode, hexadecimal, and octal sequences.
String parsing example |
Diagram |
|
|
Parsing interpolated strings⌗
Interpolated strings are a bit more challenging to represent due to the possibility of nested expression placeholders. In the following example, nested
is a valid WDL string expression:
WDL interpolated strings |
WDL |
The grammar for an interpolated string (or “istring”) is similar to that for a string literal, with the addition of a placeholder
rule that has higher precendence than the escape sequence and content rules (note that we also support the deprecated placeholder style ${expression}
).
Notice that string content rules now use choice
rather than just being a single regex. This is necessary because we want the parser to stop matching content when it encounters one of the placeholder staring characters (~
and $
) and give the placeholder
rule a chance to match. If it doesn’t match, then we have variants in the content rules to match just the ~
and $
characters by themselves.
Interpolated string grammar |
JavaScript code |
|
|
Parsing commands with an external scanner⌗
The command
section of a WDL task is a special type of interpolated string. The contents of the command section are interpreted as a Bash script, and so, unlike regular strings, unescaped newlines are allowed, and a line is allowed to end with an unescaped (\
) indicating a line continuation.
Two forms of the command
section are supported:
- The preferred “HEREDOC” form (
command <<< ... >>>
), which only supports the~{}
style of placeholder. - A deprecated form using enclosing braces (
command { ... }
), which supports both~{}
and${}
placeholders.
Parsing the second form of the command
section is straight-forward. It just requires a minor modification of the content rule we used with interpolated strings.
JavaScript code |
Most importantly, }
must be escaped when it is not part of a placeholder, otherwise the end token could match prematurely. For example, consider the following invalid WDL task, which has an unescaped }
in the command section.
WDL command missing an escape |
WDL |
Similarly, in the HEREDOC form of command
, when there are three consecutive >
, at least one of them must be escaped (e.g. \>>>
), but one or two consecutive >
do not need to be escaped. We might try to change the content regex to />{0,2}[^~>\\]+/
but this would make it impossible to match a perfectly valid sequence of characters:
An unparseable HEREDOC command |
WDL |
If instead we added a second regex rule />{1,2}/
, then an invalid string like echo ">>>"
would match as the tokens (echo "
, >>
, >
).
Another approach would be to use a lookahead assertion in our regex, e.g., />{1,2}(?=[^>])/
. However, when we try this, Tree-sitter gives us an error saying that regex lookahead assertions are not supported. This makes sense when we remember that a Tree-sitter parser only has a single token of lookahead, whereas a regex assertion can lookahead any number of characters.
Thus, the HEREDOC form of command
presents a syntax that we cannot actually describe using the Tree-sitter DSL. So what do we do? Fortunately, the author of Tree-sitter anticipated this situation and provided an escape-hatch for parsing non-LR(1)
grammar: an external scanner. A scanner is a C/C++ file that implements Tree-sitter’s external scanner API, which consists of five functions that are all prefixed with tree_sitter_<language>_external_scanner_
:
create
: Creates a new scanner instance.serialize
: Write the scanner’s current state (if any) to a byte buffer.deserialize
: Initialize the scanner from a previously serialized state.destroy
: Frees any memory used by the scanner instance.scan
: Implements the custom parsing logic. Given 1) an instance of the scanner, 2) alexer
object that provides methods for accessing and manipulating the parser’s state, and 3) a list of rules that are considered valid in the current parsing context, this function consumes any number of tokens and informs the parser which of the valid rules was matched.
The steps to implementing an external scanner are:
- Add an
externals
field to the grammar object that lists which grammar rules should be handled by the external scanner. - Implement the scanner. The scanner should define an enumeration (typically called
TokenType
) with variants in the same order as the rules in theexternals
field. The list of valid rules that Tree-sitter passes to the scanner’sscan
function is simply an array of integers that are the indexes of the valid rules in theexternals
field. - Modify
binding.gyp
andbindings/rust/build.rs
to compile the scanner as part of the build process.
Below is the complete grammar for parsing the command
section. Notice that the command_heredoc_content
rule appears in the list of choices in the command_heredoc_parts
rule, but that rule is not defined in our grammar - instead, it appears in the externals
list. This is the way we tell Tree-sitter that we want our external scanner to handle this rule.
In addition to command_heredoc_content
, the externals
array also includes an error
rule. This rule does not appear in our grammar; it is simply a placeholder. When a Tree-sitter parser encounters an error, it calls the external scanner with every rule included in the list of valid rules, to give the external scanner a chance to recover from the error. So we know that if the error
rule appears in the list of valid rules, then the parser must be in an error state, since that’s the only possible way the error
rule could be considered valid.
command grammar |
JavaScript code |
|
|
Now we can implement the scanner. Keep in mind that this is about the simplest possible case for a scanner - it only needs to handle a single rule and does not need to maintain any state between calls. If you’re interested in a more complex example, you can view the previous version of this scanner that I wrote to handle parsing of both interpolated strings and HEREDOC commands (before I realized that the former could be represented using just the grammar DSL).
External scanner |
C++ code |
|
|
To build our scanner as part of the build process, we need to update bindings/rust/build.rs
. We also need to add src/scanner.cc
to the sources
list in binding.gyp
.
Updated build.rs |
rust code |
|
|
Completing the WDL grammar⌗
We’ve now implemented most of the interesting pieces of our WDL grammar. There’s still a lot to do, but most of the remaining rules are pretty easily translated from the specification using the grammar DSL. In this section we’ll cover a few more interesting bits, after which you should have no trouble understanding the complete grammar.
Whitespace and comments⌗
Like most other programming languages, WDL allows arbitrary whitespace and comments to appear almost anywhere in the document. A WDL comment begins with a #
character and must only appear at the end a line (including on a line by itself); WDL does not have block comments.
Rather than needing to modify every rule to accomodate whitespace and comments, we can just define the top-level extras
field to specify which tokens/rules are allowed to appear anywhere.
Grammar extras |
JavaScript code |
Defining tokens using constants⌗
My personal preference is to define constants for all the literal values in my source code. We can do this for tokens (literal strings) and precedences (literal integers) in our grammar.
Using constants for literals |
JavaScript code |
|
|
Defining binary operators using map
and ...
⌗
There are many types of binary expressions, and they all have the same structure. Rather than writing these all out, we can instead map
over an array of operators and generate a rule for each one. Then we can use JavaScript’s spread syntax (...
) to insert the rules in a choice
:
Generating binary operator rules |
JavaScript code |
|
|
Testing the grammar⌗
A very nice feature of Tree-sitter is its built-in testing framework.
A test case has three parts:
- A name, which must be unique among all the test cases
- Source code to be parsed
- An S-expression representing the parse tree that is expected from parsing the source code
S-expressions should be familiar from Part 2, where we saw that they are output by Tree-sitter’s parse
command. Importantly, S-expressions do not include terminal nodes. However, they can include field names as prefixes to their associated nodes.
Let’s take our WDL from earlier in this post and turn it into a test case. Tree-sitter expects test cases to be in the test/corpus/
folder.
test/corpus/workflow.txt |
Tree-sitter test |
|
|
Now we can run the test
script that we defined in our project.json
:
Running test cases |
Shell session |
Conclusion⌗
In this third post on writing a parser in Rust, we’ve finished implementing a Tree-sitter parser for WDL, and we’ve tested it using both Tree-sitter’s built-in framework and unit tests written in Rust. In Part 4, we’ll begin writing another WDL grammar using Pest, a PEG parser generator.