Parsing with Nearley.js

Introduction to Nearley.js

This section describes the nearley grammar language, in which you can describe grammars for nearley to parse. Grammars are conventionally kept in .ne files. You can then use nearleyc to compile your .ne grammars to JavaScript modules.

You can find many examples of nearley grammars online, as well as some in the examples/ directory of the Github repository (opens in a new tab).

Vocabulary

A terminal is a single, constant string or a token. For example, the keyword "if" is a terminal.
A nonterminal describes a set of possible strings. For example, all "if" statements can be described by a single nonterminal whose value depends on the condition and body of the if statement.
A rule (or production rule) is a definition of a nonterminal. For example,
```
ifStatement -> "if" condition "then" statement "endif"
```
is the rule according to which the if statement nonterminal, ifStatement, is parsed. It depends on the nonterminals condition and statement. A nonterminal can be described by multiple rules. For example, we can add a second rule
```
ifStatement -> "if" condition "then" statement "else" statement "endif"
```
to support "else" clauses.

By default, nearley attempts to parse the first nonterminal defined in the grammar. In the following grammar, nearley will try to parse input text as an expression.

expression -> number "+" number
expression -> number "-" number
expression -> number "*" number
expression -> number "/" number
number -> [0-9]:+

You can use the pipe character | to separate alternative rules for a nonterminal. In the example below, expression has four different rules.

expression ->
    number "+" number
  | number "-" number
  | number "*" number
  | number "/" number

⚠️

Empty String

The keyword null stands for the epsilon rule $\epsilon$ , which matches nothing. The following nonterminal matches zero or more cows in a row, such as cowcowcow:

a -> null | a "cow"

Postprocessors

By default, nearley wraps everything matched by a rule into an array. For example, when rule -> "tick" "tock" matches the string "ticktock", it creates the "parse tree" ["tick", "tock"].

Most of the time, however, you need to process that data in some way: for example, you may want to filter out whitespace, or transform the results into a custom JavaScript object.

For this purpose, each rule can have a postprocessor: a JavaScript function that transforms the array and returns a "processed" version of the result. Postprocessors are wrapped in {% %}s:

ULL-ESIT-PL/learning-nearley/examples/postprocessors-example.ne

expression -> number "+" number {%
    function(data) {
        return {
            operator: "sum",
            leftOperand:  data[0],
            rightOperand: data[2] // data[1] is "+"
        };
    }
%}
number -> [0-9]:+ {% d => parseInt(d[0].join("")) %}

To compile it with nearley:

➜  examples git:(main) ✗ npx nearleyc postprocessors-example.ne -o postprocessors-example.js

and we can execute the resulting parser with nearley-test:

➜  examples git:(main) ✗ npx nearley-test -i '5+10' postprocessors-example.js               
Table length: 5
Number of parses: 1
Parse Charts
Chart: 0
0: {expression →  ● number "+" number}, from: 0
1: {number →  ● number$ebnf$1}, from: 0
2: {number$ebnf$1 →  ● /[0-9]/}, from: 0
3: {number$ebnf$1 →  ● number$ebnf$1 /[0-9]/}, from: 0

Chart: 1
0: {number$ebnf$1 → /[0-9]/ ● }, from: 0
1: {number$ebnf$1 → number$ebnf$1 ● /[0-9]/}, from: 0
2: {number → number$ebnf$1 ● }, from: 0
3: {expression → number ● "+" number}, from: 0

Chart: 2
0: {expression → number "+" ● number}, from: 0
1: {number →  ● number$ebnf$1}, from: 2
2: {number$ebnf$1 →  ● /[0-9]/}, from: 2
3: {number$ebnf$1 →  ● number$ebnf$1 /[0-9]/}, from: 2

Chart: 3
0: {number$ebnf$1 → /[0-9]/ ● }, from: 2
1: {number$ebnf$1 → number$ebnf$1 ● /[0-9]/}, from: 2
2: {number → number$ebnf$1 ● }, from: 2
3: {expression → number "+" number ● }, from: 0

Chart: 4
0: {number$ebnf$1 → number$ebnf$1 /[0-9]/ ● }, from: 2
1: {number$ebnf$1 → number$ebnf$1 ● /[0-9]/}, from: 2
2: {number → number$ebnf$1 ● }, from: 2
3: {expression → number "+" number ● }, from: 0

The rules above will parse the string 5+10 into

Parse results: 
[ { operator: 'sum', leftOperand: 5, rightOperand: 10 } ]

The postprocessor can be any function with signature

function(data, location, reject)

Here,

data: Array is an array that contains the results of parsing each part of the rule. Note that it is still an array, even if the rule only has one part! You can use the built-in {% id %} postprocessor to convert a one-item array into the item itself d => d[0].

For arrow function users, a convenient pattern is to decompose the data array within the argument of the arrow function:
```
expression ->
    number "+" number {% ([fst, _, snd]) => fst + snd %}
  | number "-" number {% ([fst, _, snd]) => fst - snd %}
  | number "*" number {% ([fst, _, snd]) => fst * snd %}
  | number "/" number {% ([fst, _, snd]) => fst / snd %}
```
location: number is the index (zero-based) at which the rule match starts. You might use this to show the location of an expression in an error message.

⚠️
Better forget
Note: Many tokenizers provide line, column, and offset information in the Token object. If you are using a tokenizer, then it is better to use that information than the nearley-provided variable, which would only tell you that it saw the nth token rather than the nth character in the string.
reject: Object is a unique object that you can return to signal that this rule doesn't actually match its input.

Reject is used in some edge cases. For example, suppose you want sequences of letters to match variables, except for the keyword if. In this case, your rule may be
```
variable -> [a-z]:+ {%
    function(d,l, reject) {
        const name = d[0].join('');
        if (name === 'if') {
            return reject;
        } else {
            return { name };
        }
    }
%}
```
⚠️
Reject is slow
Grammars using reject are not context-free, and are often much slower to parse.

nearley provides one built-in postprocessor:

id returns the first element of the data array. This is useful to extract the content of a single-element array: foo -> bar {% id %}

More syntax: tips and tricks

See Comments and Charsets in Nearley.js

EBNF: Repetitions and parenthesis

nearley supports the *, ?, and + operators from EBNF (opens in a new tab) as shown:

batman -> "na":* "batman" # nananana...nanabatman

You can also use capture groups with parentheses. Its contents can be anything that a rule can have:

banana -> "ba" ("na" {% id %} | "NA" {% id %}):+

Macros

See section Macros

Additional JS

For more intricate postprocessors, or any other functionality you may need, you can include chunks of JavaScript code between production rules by surrounding it with @{% ... %}:

@{%
const cowSays = require("./cow.js");
%}
 
cow -> "moo" {% ([moo]) => cowSays(moo) %}

Note that it doesn't matter where you add these; they all get hoisted to the top of the generated code.

Importing other grammars

See section Importing other grammars

Lexical Analysis with Moo

See section Lexical Analysis with Moo

Example: A Calculator

Grammar

See examples/calculator/arithmetic-lexer.ne (opens in a new tab)

examples/calculator/arithmetic-lexer.ne

# How to Implement Lexical Analysis in Tools with a separated Lexer
# To use it run: 
# nearleyc arithmetic-lexer.ne -o grammar.js  && node use-arithmetic.js 'ln (3 + 2*(8/e - sin(pi/5)))'
# It parses valid calculator input
#   ln (3 + 2*(8/e - sin(pi/5)))
# is valid input.
 
@{%
 
const lexer = require('./lex.js');
 
const bin = (([x, op, y]) => op(x,y));
const Null = (d => null);
const fac = n => (n===0)?1:n*fac(n-1);
const unaryPost = (([p, op]) => op(p));
const funApply = ([fun, arg]) => fun(arg);
%}
 
@lexer lexer
 
main => null {% d => "" %} # Allow for empty lines
    | AS {% function(d) {return d[0]; } %}
 
 
# Addition and subtraction
AS -> AS PLUS MD  {% bin %}  # Prefer this syntax
    | AS MINUS MD {% bin %}
    | MD          {% id %}
 
# Multiplication and division
MD -> MD MULT E  {% bin %}
    | MD DIV E   {% bin %}
    | E          {% id %}
 
# Exponents
E -> F EXP E    {% bin %}
    | F         {% id %}
 
# Factorial 
F ->  P FACTORIAL    {% unaryPost %}
    | P              {% id %} 
 
# Fixed "bug" sinpi
 
P -> Q
    | FLOAT     {% id %}
    | SIN  Q    {% funApply %}
    | COS Q     {% funApply %}
    | TAN Q     {% funApply %}
    | ASIN Q    {% funApply %}
    | ACOS Q    {% funApply %}
    | ATAN Q    {% funApply %}
    | PI        {% id %}
    | EULER     {% id %}
    | SQRT Q    {% funApply %}
    | LN Q      {% funApply %}
 
# Parentheses
Q ->  LP AS RP  {% ([lp, as, rp]) => as %}
 
 
##### LEXICAL ANALYSIS #################################################
 
FLOAT -> %number  
PLUS -> "+"      {% function(d) {return ((a,b) => a+b); } %}
MINUS -> "-"     {% function(d) {return ((a,b) => a-b); } %}
MULT -> "*"      {% function(d) {return ((a,b) => a*b); } %}
DIV -> "/"       {% function(d) {return ((a,b) => a/b); } %}
EXP -> "^"       {% function(d) {return ((a,b) => Math.pow(a,b)); } %}
FACTORIAL -> "!"   {% d => fac %}
LP -> "("         {% Null %}
RP -> ")"         {% Null %}
SIN -> "sin"      {% d => Math.sin %}
COS -> "cos"      {% d => Math.cos %}
TAN -> "tan"      {% d => Math.tan %}
ASIN -> "asin"    {% d => Math.asin %}
ACOS -> "acos"    {% d => Math.acos %}
ATAN -> "atan"    {% d => Math.atan %}
PI -> %pi         {% d => Math.PI %}
EULER -> %e       {% d => Math.E  %}
SQRT -> %sqrt     {% d => Math.sqrt %}
LN -> %ln         {% d => Math.log %}

Lexical Analyzer

See examples/calculator/lex.js (opens in a new tab)

examples/calculator/lex.js

const { makeLexer, moo } = require("moo-ignore");
 
const Tokens = {
  space: { match: /\s+/, lineBreaks: true },
  number: { match: /\d+(?:\.\d+)?\b/, value: x => Number(x) },
  "+": "+",
  "-": "-",
  "*": "*",
  "/": "/",
  "^": "^",
  "!": "!",
  "(": "(",
  ")": ")",
  id: {
    match: /[a-z_][a-z_0-9]*/,
    type: moo.keywords({
      sin: "sin",
      cos: "cos",
      tan: "tan",
      asin: "asin",
      acos: "acos",
      atan: "atan",
      pi: "pi",
      e: "e",
      sqrt: "sqrt",
      ln: "ln"
    }),
  },
};
 
let lexer = makeLexer(Tokens, ["space"]);
 
module.exports = lexer;

main

See examples/calculator/use-arithmetic.js (opens in a new tab)

examples/calculator/use-arithmetic.js

const nearley = require("nearley");
const grammar = require("./grammar.js");
 
let s = process.argv[2] || ` 4 * pi / e`;
 
try {
  const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));
  parser.feed(s);
  console.log(parser.results[0]) 
} catch (e) {
    console.log(e);
}

Execution

nearleyc arithmetic-lexer.ne -o grammar.js && node use-arithmetic.js

Performance of NearleyJS

This study by Toby Ho shows a case in which NearleyJS is 200 times slower than a Recursive Descendant Functional parser :disappointed:.

The Earley Algorithm

See the section The Earley Algorithm Explained

References

Repo ULL-ESIT-PL/learning-nearley (opens in a new tab)
Nearley.JS Home Page (opens in a new tab)
Nearley.JS GitHub repo (opens in a new tab) is a JS parser generator using the Early algorithm
- Nearley.js Grammar in Nearley.js (opens in a new tab)
Toby Ho Course on How to Make a Parser with Nearley.JS
Tokenizers for nearley.js (opens in a new tab)
fun-lang in Nearley.js (opens in a new tab) GitHub Repo by Toby Ho
- scripts (opens in a new tab) folder: How to run the compiler
- Toby Ho has published a more advanced version of the language as a npm package: play-lang (opens in a new tab)
Parsing in JavaScript: Tools and Libraries (opens in a new tab)

moo

Earley Parsers The Nearley algorithm