Parsing with Rust - Part 2: writing an LR grammar with Tree-sitter
The second post in a series on writing programming-language parsers in Rust. In Part 1, we covered general parsing concepts and looked at some of the Rust crates available for generating a parser. In this post, we’ll walk through writing a Tree-sitter grammar for WDL - a domain-specific language for describing computational workflows - and using the parser in Rust.
Tree-sitter⌗
As we briefly covered in Part 1, Tree-sitter is a widely used parser generator framework that generates a Generalized LR (GLR) parser with Rust bindings from a grammar specification.
Tree-sitter is implemented using an interesting mix of languages:
- A Tree-sitter grammar is specified in JavaScript. More specifically, it is a node.js module that exports a grammar definition, which is serialized to JSON.
- The Tree-sitter command line interface (CLI) is written in Rust. It provides a
generate
command, which callsnode
with the grammar file, deserializes the generated JSON object, and uses it to generate the parser code inC
. - The generated
C
code depends on the Tree-sitter runtime, aC
library calledlibtree-sitter
. The compiled parser can be called from a C/C++ program. - The CLI also generates Rust and Node bindings to the parser that enable it to be used from those languages as well.
In the remainder of this post, we’ll walk through the steps of writing the grammar, generating the parser, and using it from a Rust program. We’ll be writing a grammar for Workflow Description Language (WDL), a domain-specific language for describing computational workflows. We’ll be focusing on just on WDL 1.1, the current version of the specification. WDL is useful as an example because it’s powerful enough to exercise some of the advanced features of parser generator frameworks yet simple enough that we can write a complete parser without getting bogged down with the ambiguity and context-specificity that are found in most general-purpose languages.
You can find all of the code and auxilliary files for the parser in the GitHub repository.
Getting set up⌗
The first step in developing our Tree-sitter grammar is to install the dependencies - node.js, the node package manager (npm
), and a C/C++ compiler - and setup a new project. These steps are covered in the Getting Started section of the Tree-sitter manual. It is recommended to install node.js using nvm
.
The project we’ll be creating is called tree-sitter-wdl
, which follows the preferred convention of naming the parser repository tree-sitter-<language>
. After completing the setup process, we have a project folder with a package.json
file that describes our project and provides the configuration required by npm
to build our parser. The key parts of this file are described below, excluding any metadata that is just needed for publishing the package in the npm repository.
package.json |
JSON |
|
|
name
: A unique name for the package, following Tree-sitter’s preferred convention.version
: The initial version of the package, following the conventions of semantic versioning. It’s a good idea to use a major version0
until the grammar is complete, finialized, and well-tested.main
: The path to the main file when the package is used as a node library or executable. The files inbindings/node
will be generated by the Tree-sitter CLI.scripts
: Allows defining aliases for commonly used commands. These can then be run usingnpm <alias>
. I’ve defined abuild
alias for generating and building the parser, and atest
alias for running the test cases that we’ll write later on.dependencies
: Dependencies that are required to use the generated parser. All Tree-sitter parsers have one required dependency,nan
.devDependencies
: Dependencies that are required to generate and build the parser. All Tree-sitter parsers have one required dev dependency,tree-sitter-cli
.tree-sitter
: Configuration for using the grammar for syntax highlighting.
Testing the parser generator⌗
Now that we have the project set up, we can get to work on the grammar. The grammar is defined in a file called grammar.js
in the root directory of the project. We are going to start with a bare-bones grammar just to make sure the CLI is working as expected, and to get familiar with the different files that it generates.
A bare-bones grammar |
JavaScript code |
Let’s walk through our initial grammar file line-by-line:
- Node.js implements the CommonJS standard, which (among other things) specifies how JavaScript modules are defined outside of a web browser context. The
module
object is defined in each JavaScript file, and itsexports
field contains an object that is exposed to a user of the module. It is initialized to an empty object, but often (as in the example above) it is reassigned to a class instance. The https://github.com/tree-sitter/tree-sitter/blob/master/cli/src/generate/dsl.js function is provided by Tree-sitter to convert a grammar definition into an object that is serialized to JSON. - The grammar definition is a JavaScript object whose fields are specified by the grammar DSL.
name
is a required field whose value is the name of the language. Tree-sitter will use the langauge name in several places, e.g., the names of functions in the code it generates, to make them unique (versus other Tree-sitter-generated parsers). rules
is a required field with an object value, whose keys are rule names and whose values are production rules.- A WDL source file is called a “document”, and the grammar defines a corresponding
document
rule. A production rule is written as a JavaScript function with a single parameter, called$
by convention, and whose body is either a literal or a call to one of the functions provide by the DSL. The$
parameter is an object whose fields are the rules defined in the grammar. The body of thedocument
rule specifies that a document contains a single element, which is defined by theversion
rule. Note that rules don’t have to be defined in any certain order - forward references are okay. - The
version
rule is defined using theseq
function, which specifies a sequence of rules that are expected to match in order. Theversion
rule matches the tokensversion
followed by1.1
, with only whitespace allowed in between.
After creating the grammar.js
file, the structure of the project looks like this:
Initial project structure |
Directory listing |
You will only see node_modules
if you installed the Tree-sitter CLI during project setup (otherwise it will be installed automatically during the next step). Now, if we’ve done everything correctly, the following command should generate our parser:
Generating the parser |
Shell session |
Gyp is a build system that builds native addon modules for node. Tree-sitter generates a build file and then uses gyp
to actually compile the generated parser and bindings.
Now our project directory has a lot more files:
Directory listing |
|
|
We’ll come back to some of these files later. The important thing for now is that Tree-sitter did what we expected it to do. We should now be able to use the CLI to parse a WDL file that matches our grammar:
Parsing a WDL file that matches the initial grammar |
Shell session |
The parse
command attempts to parse the input text, and, if successful, prints out the parse tree as an S-expression. Each node of an S-expression is of the form (name children*)
, where children*
is zero or more child nodes. Note that the S-expression only shows the matched rules - any terminal nodes (i.e., tokens) are not shown.
Tree-sitter also prints out the span of each node - the [line, column]
positions of the start and end of the matching text within the input string. In Tree-sitter, both line and column positions start at 0
, and column positions are end-exclusive. So in the example above, we see that the version
node matches the 11 characters starting at line 0
, column 0
, and ending at line 0
, column 10
(inclusive). The document
node matches the entire input text, which also includes a newline, so it ends on line 1
.
Writing some simple grammar rules⌗
Right now our WDL grammar is correct, but it’s nowhere near complete. The version
statement must appear on the first non-comment line of a WDL document, but a valid WDL document must also contain a “body” consisting of one or more top-level elements in any order:
task
: The fundamental unit of computation. A task defines its input and output variables, it encapsulates a Bash script that can reference those variables, and it specifies a Docker image in which to execute the script.workflow
: Represents a directed, acyclic graph (DAG) of tasks. The execution order of the tasks in a workflow is defined implicitly by the dependencies between the tasks.struct
: A user-defined type. WDL has built-in support for primitive types (e.g., strings, numbers, booleans, and files) and a few generic collection types; structs provide the ability to create more specialized compound types.
A WDL document may also include any number of top-level import
statements, which enable referencing the elements of one WDL document from another.
In other words, we want our grammar to express:
- Production rules for the 4 top-level variants
- That the
version
rule must come first - That at least one of the
task
,workflow
, orstruct
variants is required - That any number of top-level items is allowed
The Tree-sitter grammar DSL provides a few functions to facilitate writing these rules:
seq
takes two or more rules (or literals) and matches them in orderchoice
selects from among multiple variantsrepeat1
takes one rule or literal and matches it one or more times consecutively
With these functions, we can expand our grammar to meet almost all our goals.
Our grammar with a few more rules |
JavaScript code |
|
|
Note that we can take advantage of the fact that a grammar file is just a node module and add our own JavaScript functions. In this case, we’ve defined a todo
function that we can use as a placeholder for the rules we need to implement. It prints a warning and returns the rule name as a literal, which will allow the grammar to build successfully even though it is incomplete.
The main problem with our current grammar is that it would parse the following as a valid document, when in fact it is not, since it doesn’t contain one of the three required top-level elements.
invalid.wdl |
WDL |
Expressing something like “allow rules A, B, C, and D to match in any order, but require at least one of A|B|C” is actually somewhat challenging. It can be done by employing another built-in fuction, repeat
, which takes one rule or literal and matches it zero or more times consecutively. However, it makes for a (perhaps overly) complex rule.
A more correct and complex grammar |
JavaScript code |
|
|
Whether you want to do something like this in your own grammar is to some degree a matter of personal choice, although it can also slow down your parser. For the official WDL grammar, I opted to go for simpler rules and defer the validation of more complex requirements like “there must be at least one task, workflow, or struct” until after parsing the document.
Calling the parser from Rust⌗
Our grammar is still far from complete, but it’s at least interesting enough now that we can try using it from a Rust program.
Recall that when we generated our parser, Tree-sitter also generated a bunch of other files. The ones we’re interested in right now are:
Directory listing |
These files enable Cargo (the Rust build tool) to compile the C parser (src/parser.c
), load the resulting object file (which will be found at build/Release/obj.target/tree_sitter_wdl_binding/src/parser.o
), provide a Rust function to access the parser as an instance of tree_sitter::Language
, and package all these pieces as a “crate” (an installable package).
A build.rs
file is just a Rust binary, i.e., it has a main()
function that is called when it is compiled and executed by Cargo. The generated bindings/rust/build.rs
uses the cc
crate to call the C compiler.
build.rs |
rust code |
|
|
The lib.rs
file uses an extern
block to define a Rust function (tree_sitter_<name>
, where name
is the language name from the grammar) that calls the function of the same name in the C object file produced by the build script. Rust considers calling any external function to be unsafe, so it is necessary to wrap any call to tree_sitter_wdl
in an unsafe
block. It is common practice to make external functions private and wrap them with a function that performs the unsafe call.
The generated lib.rs |
rust code |
The tree_sitter::Language
struct contains all the configuration specific to our language. We can use the instance returned by the language
function to create a tree_sitter::Parser
, which will actually enable us to parse a WDL document. We can add some additional functions to our lib.rs
to make things easier for a user of our crate. We can also use the thiserror
crate to create informative errors.
lib.rs with errors and convenience functions |
rust code |
|
|
The Cargo.toml
file is the Rust equivalent of package.json
- it contains the configuration necessary to compile the package’s code, including any dependencies on other crates, as well as metadata used for publishing the package to the crates.io
repository when we’re ready to make our parser available to others. The key parts of this file are shown below.
Cargo.toml |
TOML |
|
|
To build our package into a crate, we use the cargo build
command. The first time we run cargo build
in our project, Cargo downloads and compiles all the dependencies (and dependencies-of-dependencies, etc.) before compiling and packaging our package code.
Building the Rust bindings |
Shell session |
|
|
We can find the binary in target/debug/libtree_sitter_wdl.rlib
. We can’t really do anything with this rlib
file - it’s a Rust-specific format and is only meant to be used as a dependency in another crate.
We could create another create to import and test out our parser, but an easier approach is to create unit tests directly in our parser lib.rs
. In Rust, unit tests are placed in a separate module annotated with #[cfg(test)]
. Any function in a test module that is annotated with #[test]
is executed by the cargo test
command, and any errors are reported as test failures.
Basic unit tests |
rust code |
|
|
Here we’ve created a tests
module with two test functions: test_can_load_grammar
, which just tries to instantiate a tree_sitter::Parser
with the WDL tree_sitter::Language
instance, and test_parse
, which actually tries to parse a WDL file. The test_parse
test is looking for a file at resources/test/simple.wdl
; let’s create one that matches our current grammar:
resources/test/simple.wdl |
WDL |
Now we can run our test:
Test output |
Shell session |
Using the parse tree⌗
In the previous post, we successfully parsed a WDL file using our Tree-sitter parser from Rust, but we didn’t do anything with the tree_sitter::Tree
instance that our parser returned. Let’s expand our test case to actually navigate the parse tree.
Tree-sitter provides two different APIs for working with the parse tree:
tree_sitter::TreeCursor
: Low-level API. A cursor maintains a pointer to a node in the tree, and has a minimal set of methods for moving the pointer. Thenode
method returns an instance oftree_sitter::Node
for the node it’s currently pointing to.tree_sitter::Node
: High-level API. ANode
contains metadata (rule name, span, and string value) about a node in the tree and provides functions for accessing related nodes (parent, siblings, and children). Internally, aNode
maintains a reference to aTreeCursor
, and many ofNode
’s methods require creating a copy of the cursor using thewalk
method.
The high-level API is more ergonomic, so we’ll stick to that for now. Node
provides idiomatic methods to navigate the parse tree using iterators. The drawback of this approach is that navigating multiple levels of the tree requires creating multiple copies of the TreeCursor
, and each of these copy operations is relatively slow. In a later post we’ll write a custom iterator that wraps a TreeCursor
to provide better ergonomics while maintaining good performance.
A node that corresponds to a production rule is called a “named” node. All non-terminal nodes are named; a terminal node may be named (e.g. the workflow
node) or not (e.g., the nodes corresponding to the version
and 1.1
tokens). Each node has a kind
, which is either its rule name (for a named node) or its literal value (for a terminal token). Obtaining the text value of the node requires passing in a reference to the input text (as a byte array).
Let’s expand our test case to use the high-level API to validate the contents of the tree returned by the parser:
Expanded test case |
rust code |
|
|
Conclusion⌗
In this second post on writing a parser in Rust, we’ve implemented a complete (although very simple) GLR parser in Rust using Tree-sitter. In Part 3, we’ll expand our WDL grammar, and cover more advanced aspects of Tree-sitter in the process.