Parsing with Rust - Part 2: writing an LR grammar with Tree-sitter
The second post in a series on writing programming-language parsers in Rust. In Part 1, we covered general parsing concepts and looked at some of the Rust crates available for generating a parser. In this post, we’ll walk through writing a Tree-sitter grammar for WDL - a domain-specific language for describing computational workflows - and using the parser in Rust.
As we briefly covered in Part 1, Tree-sitter is a widely used parser generator framework that generates a Generalized LR (GLR) parser with Rust bindings from a grammar specification.
Tree-sitter is implemented using an interesting mix of languages:
- The Tree-sitter command line interface (CLI) is written in Rust. It provides a
generatecommand, which calls
nodewith the grammar file, deserializes the generated JSON object, and uses it to generate the parser code in
- The generated
Ccode depends on the Tree-sitter runtime, a
libtree-sitter. The compiled parser can be called from a C/C++ program.
- The CLI also generates Rust and Node bindings to the parser that enable it to be used from those languages as well.
In the remainder of this post, we’ll walk through the steps of writing the grammar, generating the parser, and using it from a Rust program. We’ll be writing a grammar for Workflow Description Language (WDL), a domain-specific language for describing computational workflows. We’ll be focusing on just on WDL 1.1, the current version of the specification. WDL is useful as an example because it’s powerful enough to exercise some of the advanced features of parser generator frameworks yet simple enough that we can write a complete parser without getting bogged down with the ambiguity and context-specificity that are found in most general-purpose languages.
You can find all of the code and auxilliary files for the parser in the GitHub repository.
Getting set up⌗
The first step in developing our Tree-sitter grammar is to install the dependencies - node.js, the node package manager (
npm), and a C/C++ compiler - and setup a new project. These steps are covered in the Getting Started section of the Tree-sitter manual. It is recommended to install node.js using
The project we’ll be creating is called
tree-sitter-wdl, which follows the preferred convention of naming the parser repository
tree-sitter-<language>. After completing the setup process, we have a project folder with a
package.json file that describes our project and provides the configuration required by
npm to build our parser. The key parts of this file are described below, excluding any metadata that is just needed for publishing the package in the npm repository.
name: A unique name for the package, following Tree-sitter’s preferred convention.
version: The initial version of the package, following the conventions of semantic versioning. It’s a good idea to use a major version
0until the grammar is complete, finialized, and well-tested.
main: The path to the main file when the package is used as a node library or executable. The files in
bindings/nodewill be generated by the Tree-sitter CLI.
scripts: Allows defining aliases for commonly used commands. These can then be run using
npm <alias>. I’ve defined a
buildalias for generating and building the parser, and a
testalias for running the test cases that we’ll write later on.
dependencies: Dependencies that are required to use the generated parser. All Tree-sitter parsers have one required dependency,
devDependencies: Dependencies that are required to generate and build the parser. All Tree-sitter parsers have one required dev dependency,
tree-sitter: Configuration for using the grammar for syntax highlighting.
Testing the parser generator⌗
Now that we have the project set up, we can get to work on the grammar. The grammar is defined in a file called
grammar.js in the root directory of the project. We are going to start with a bare-bones grammar just to make sure the CLI is working as expected, and to get familiar with the different files that it generates.
|A bare-bones grammar|
Let’s walk through our initial grammar file line-by-line:
exportsfield contains an object that is exposed to a user of the module. It is initialized to an empty object, but often (as in the example above) it is reassigned to a class instance. The https://github.com/tree-sitter/tree-sitter/blob/master/cli/src/generate/dsl.js function is provided by Tree-sitter to convert a grammar definition into an object that is serialized to JSON.
nameis a required field whose value is the name of the language. Tree-sitter will use the langauge name in several places, e.g., the names of functions in the code it generates, to make them unique (versus other Tree-sitter-generated parsers).
rulesis a required field with an object value, whose keys are rule names and whose values are production rules.
- A WDL source file is called a “document”, and the grammar defines a corresponding
$by convention, and whose body is either a literal or a call to one of the functions provide by the DSL. The
$parameter is an object whose fields are the rules defined in the grammar. The body of the
documentrule specifies that a document contains a single element, which is defined by the
versionrule. Note that rules don’t have to be defined in any certain order - forward references are okay.
versionrule is defined using the
seqfunction, which specifies a sequence of rules that are expected to match in order. The
versionrule matches the tokens
1.1, with only whitespace allowed in between.
After creating the
grammar.js file, the structure of the project looks like this:
|Initial project structure|
You will only see
node_modules if you installed the Tree-sitter CLI during project setup (otherwise it will be installed automatically during the next step). Now, if we’ve done everything correctly, the following command should generate our parser:
|Generating the parser|
Gyp is a build system that builds native addon modules for node. Tree-sitter generates a build file and then uses
gyp to actually compile the generated parser and bindings.
Now our project directory has a lot more files:
We’ll come back to some of these files later. The important thing for now is that Tree-sitter did what we expected it to do. We should now be able to use the CLI to parse a WDL file that matches our grammar:
|Parsing a WDL file that matches the initial grammar|
parse command attempts to parse the input text, and, if successful, prints out the parse tree as an S-expression. Each node of an S-expression is of the form
(name children*), where
children* is zero or more child nodes. Note that the S-expression only shows the matched rules - any terminal nodes (i.e., tokens) are not shown.
Tree-sitter also prints out the span of each node - the
[line, column] positions of the start and end of the matching text within the input string. In Tree-sitter, both line and column positions start at
0, and column positions are end-exclusive. So in the example above, we see that the
version node matches the 11 characters starting at line
0, and ending at line
10 (inclusive). The
document node matches the entire input text, which also includes a newline, so it ends on line
Writing some simple grammar rules⌗
Right now our WDL grammar is correct, but it’s nowhere near complete. The
version statement must appear on the first non-comment line of a WDL document, but a valid WDL document must also contain a “body” consisting of one or more top-level elements in any order:
task: The fundamental unit of computation. A task defines its input and output variables, it encapsulates a Bash script that can reference those variables, and it specifies a Docker image in which to execute the script.
workflow: Represents a directed, acyclic graph (DAG) of tasks. The execution order of the tasks in a workflow is defined implicitly by the dependencies between the tasks.
struct: A user-defined type. WDL has built-in support for primitive types (e.g., strings, numbers, booleans, and files) and a few generic collection types; structs provide the ability to create more specialized compound types.
A WDL document may also include any number of top-level
import statements, which enable referencing the elements of one WDL document from another.
In other words, we want our grammar to express:
- Production rules for the 4 top-level variants
- That the
versionrule must come first
- That at least one of the
structvariants is required
- That any number of top-level items is allowed
The Tree-sitter grammar DSL provides a few functions to facilitate writing these rules:
seqtakes two or more rules (or literals) and matches them in order
choiceselects from among multiple variants
repeat1takes one rule or literal and matches it one or more times consecutively
With these functions, we can expand our grammar to meet almost all our goals.
|Our grammar with a few more rules|
todo function that we can use as a placeholder for the rules we need to implement. It prints a warning and returns the rule name as a literal, which will allow the grammar to build successfully even though it is incomplete.
The main problem with our current grammar is that it would parse the following as a valid document, when in fact it is not, since it doesn’t contain one of the three required top-level elements.
Expressing something like “allow rules A, B, C, and D to match in any order, but require at least one of A|B|C” is actually somewhat challenging. It can be done by employing another built-in fuction,
repeat, which takes one rule or literal and matches it zero or more times consecutively. However, it makes for a (perhaps overly) complex rule.
|A more correct and complex grammar|
Whether you want to do something like this in your own grammar is to some degree a matter of personal choice, although it can also slow down your parser. For the official WDL grammar, I opted to go for simpler rules and defer the validation of more complex requirements like “there must be at least one task, workflow, or struct” until after parsing the document.
Calling the parser from Rust⌗
Our grammar is still far from complete, but it’s at least interesting enough now that we can try using it from a Rust program.
Recall that when we generated our parser, Tree-sitter also generated a bunch of other files. The ones we’re interested in right now are:
These files enable Cargo (the Rust build tool) to compile the C parser (
src/parser.c), load the resulting object file (which will be found at
build/Release/obj.target/tree_sitter_wdl_binding/src/parser.o), provide a Rust function to access the parser as an instance of
tree_sitter::Language, and package all these pieces as a “crate” (an installable package).
build.rs file is just a Rust binary, i.e., it has a
main() function that is called when it is compiled and executed by Cargo. The generated
bindings/rust/build.rs uses the
cc crate to call the C compiler.
lib.rs file uses an
extern block to define a Rust function (
name is the language name from the grammar) that calls the function of the same name in the C object file produced by the build script. Rust considers calling any external function to be unsafe, so it is necessary to wrap any call to
tree_sitter_wdl in an
unsafe block. It is common practice to make external functions private and wrap them with a function that performs the unsafe call.
|The generated lib.rs|
tree_sitter::Language struct contains all the configuration specific to our language. We can use the instance returned by the
language function to create a
tree_sitter::Parser, which will actually enable us to parse a WDL document. We can add some additional functions to our
lib.rs to make things easier for a user of our crate. We can also use the
thiserror crate to create informative errors.
|lib.rs with errors and convenience functions|
Cargo.toml file is the Rust equivalent of
package.json - it contains the configuration necessary to compile the package’s code, including any dependencies on other crates, as well as metadata used for publishing the package to the
crates.io repository when we’re ready to make our parser available to others. The key parts of this file are shown below.
To build our package into a crate, we use the
cargo build command. The first time we run
cargo build in our project, Cargo downloads and compiles all the dependencies (and dependencies-of-dependencies, etc.) before compiling and packaging our package code.
|Building the Rust bindings|
We can find the binary in
target/debug/libtree_sitter_wdl.rlib. We can’t really do anything with this
rlib file - it’s a Rust-specific format and is only meant to be used as a dependency in another crate.
We could create another create to import and test out our parser, but an easier approach is to create unit tests directly in our parser
lib.rs. In Rust, unit tests are placed in a separate module annotated with
#[cfg(test)]. Any function in a test module that is annotated with
#[test] is executed by the
cargo test command, and any errors are reported as test failures.
|Basic unit tests|
Here we’ve created a
tests module with two test functions:
test_can_load_grammar, which just tries to instantiate a
tree_sitter::Parser with the WDL
tree_sitter::Language instance, and
test_parse, which actually tries to parse a WDL file. The
test_parse test is looking for a file at
resources/test/simple.wdl; let’s create one that matches our current grammar:
Now we can run our test:
Using the parse tree⌗
In the previous post, we successfully parsed a WDL file using our Tree-sitter parser from Rust, but we didn’t do anything with the
tree_sitter::Tree instance that our parser returned. Let’s expand our test case to actually navigate the parse tree.
Tree-sitter provides two different APIs for working with the parse tree:
tree_sitter::TreeCursor: Low-level API. A cursor maintains a pointer to a node in the tree, and has a minimal set of methods for moving the pointer. The
nodemethod returns an instance of
tree_sitter::Nodefor the node it’s currently pointing to.
tree_sitter::Node: High-level API. A
Nodecontains metadata (rule name, span, and string value) about a node in the tree and provides functions for accessing related nodes (parent, siblings, and children). Internally, a
Nodemaintains a reference to a
TreeCursor, and many of
Node’s methods require creating a copy of the cursor using the
The high-level API is more ergonomic, so we’ll stick to that for now.
Node provides idiomatic methods to navigate the parse tree using iterators. The drawback of this approach is that navigating multiple levels of the tree requires creating multiple copies of the
TreeCursor, and each of these copy operations is relatively slow. In a later post we’ll write a custom iterator that wraps a
TreeCursor to provide better ergonomics while maintaining good performance.
A node that corresponds to a production rule is called a “named” node. All non-terminal nodes are named; a terminal node may be named (e.g. the
workflow node) or not (e.g., the nodes corresponding to the
1.1 tokens). Each node has a
kind, which is either its rule name (for a named node) or its literal value (for a terminal token). Obtaining the text value of the node requires passing in a reference to the input text (as a byte array).
Let’s expand our test case to use the high-level API to validate the contents of the tree returned by the parser:
|Expanded test case|
In this second post on writing a parser in Rust, we’ve implemented a complete (although very simple) GLR parser in Rust using Tree-sitter. In Part 3, we’ll expand our WDL grammar, and cover more advanced aspects of Tree-sitter in the process.