Lexical Analysis with Moo and Moo-ignore
By default, nearley splits the input into a stream of characters. This is called scannerless parsing.
Moo
Lexing with Moo
The @lexer
directive instructs Nearley to use a lexer you've defined inside a
Javascript block in your grammar.
nearley supports and recommends Moo (opens in a new tab), a
super-fast lexer. Construct a lexer using moo.compile
.
When using a lexer, there are two ways to match tokens:
-
Use
%token
to match a token with typetoken
.line -> words %newline
-
Use
"foo"
to match a token with textfoo
.This is convenient for matching keywords:
ifStatement -> "if" condition "then" block
Here is an example of a simple grammar:
@{%
const moo = require("moo");
const lexer = moo.compile({
ws: /[ \t]+/,
number: /[0-9]+/,
word: { match: /[a-z]+/, type: moo.keywords({ times: "x" }) },
times: /\*/
});
%}
# Pass your lexer object using the @lexer option:
@lexer lexer
expr -> multiplication {% id %} | trig {% id %}
# Use %token to match any token of that type instead of "token":
multiplication -> %number %ws %times %ws %number {% ([first, , , , second]) => first * second %}
# Literal strings now match tokens with that text:
trig -> "sin" %ws %number {% ([, , x]) => Math.sin(x) %}
Compilation:
✗ nearleyc nearley-with-moo-example.ne -o nearley-with-moo-example.js
and execution:
✗ nearley-test -qi '2 * 3' nearley-with-moo-example.js
[ 6 ]
}
✗ nearley-test -qi 'sin 3' nearley-with-moo-example.js
[ 0.1411200080598672 ]
Note how the management of white spaces is cumbersome and leads to errors:
✗ nearley-test -qi 'sin 3 ' nearley-with-moo-example.js
Error: Syntax error at line 1 col 6
Have a look at the Moo documentation (opens in a new tab) to learn more about writing a tokenizer.
You use the parser as usual: call parser.feed(data)
, and nearley will give
you the parsed results in return.
Writing a Custom lexer for Nearley
nearley recommends using a moo (opens in a new tab)-based lexer. However, you can use any lexer that conforms to the following interface:
next()
returns a token object, which could have fields for line number, etc. Importantly, a token object must have avalue
attribute.save()
returns an info object that describes the current state of the lexer. nearley places no restrictions on this object.reset(chunk, info)
sets the internal buffer of the lexer tochunk
, and restores its state to a state returned bysave()
.formatError(token)
returns a string with an error message describing a parse error at that token (for example, the string might contain the line and column where the error was found).
Note: if you are searching for a lexer that allows indentation-aware grammars (like in Python), you can still use moo. See this example (opens in a new tab) or the moo-indentation-lexer (opens in a new tab) module.
moo: Examples
See repo ULL-ESIT-PL/moo-examples (opens in a new tab) for examples of use of Moo.
A moo lexer object is a Generator
A moo lexer object is a Generator (opens in a new tab), you can use filter()
and map()
which are built-in to JavaScript.
See moo issue: no-context/moo/issues/156 (opens in a new tab)
const moo = require('moo')
const lex = moo.compile({
// If one rule is /u then all must be
ws: { match: /\p{White_Space}+/u, lineBreaks: true },
word: /\p{XID_Start}\p{XID_Continue}*/u,
op: moo.fallback,
});
ID_Start
characters are derived from the Unicode General_Category
. In set notation:
/[\p{L}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]/u
ID_Continue characters
in set notation is:
/[\p{ID_Start}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]/
See https://unicode.org/reports/tr31/ (opens in a new tab).
The expression moo.fallback
matches anything else.
I believe is similar to:
{ match: /(?:.|\n)/u, lineBreaks: true}
Observe how we feed the lexer using the reset
method.
Using the spread operator on the returned generator we get an array with the token
objects:
const result = [...lex.reset('while ( a < 3 ) { a += 1; }')];
Something like:
[
{
type: 'word',
value: 'while',
text: 'while',
toString: [Function: tokenToString],
offset: 0,
lineBreaks: 0,
line: 1,
col: 1
},
{
type: 'ws',
value: ' ',
text: ' ',
toString: [Function: tokenToString],
offset: 5,
lineBreaks: 0,
line: 1,
col: 6
},
... etc.
]
We can filter the array:
let filtered = result.filter(t => t.type !== 'ws');
console.log(filtered.map(function (t) { return { type: t.type, value: t.value } }) );
No longer white spaces:
[
{ type: 'word', value: 'while' }, { type: 'op', value: '(' },
{ type: 'word', value: 'a' }, { type: 'op', value: '<' },
{ type: 'op', value: '3' }, { type: 'op', value: ')' },
{ type: 'op', value: '{' }, { type: 'word', value: 'a' },
{ type: 'op', value: '+=' }, { type: 'op', value: '1;' },
{ type: 'op', value: '}' }
]
See file ULL-ESIT-PL/moo-examples/skip-spaces.js (opens in a new tab) for the full code.
Not a Solution for Nearley.JS
The trick above does not work to build a lexer with token-skip for Nearley.js. Regrettably, Nearley.JS imposes the restrictions specified in section Writing a Custom lexer for Nearley and requires a Moo compatible lexer. That means we have to wrap the returned array in a lexer complaining with a Moo-like API!
moo-ignore
moo-ignore: Skipping Tokens in Moo
Use Moo-ignore (opens in a new tab).
Moo-ignore (🐄) is a wrapper around the moo (opens in a new tab) tokenizer/lexer generator that provides a nearley.js (opens in a new tab) compatible lexer with the capacity to ignore specified tokens.
You can use it in your Nearley.js program and ignore some tokens like white spaces and comments:
@{%
const tokens = require("./tokens");
const { makeLexer } = require("../index.js");
let lexer = makeLexer(tokens);
lexer.ignore("ws", "comment");
const getType = ([t]) => t.type;
%}
@lexer lexer
S -> FUN LP name COMMA name COMMA name RP
DO
DO END SEMICOLON
DO END
END
END
name -> %identifier {% getType %}
COMMA -> "," {% getType %}
LP -> "(" {% getType %}
RP -> ")" {% getType %}
END -> %end {% getType %}
DO -> %dolua {% getType %}
FUN -> %fun {% getType %}
SEMICOLON -> ";" {% getType %}
Alternatively, you can set to ignore some tokens in the call to makeLexer
:
let lexer = makeLexer(tokens, ["ws", "comment"]);
Or you can also combine both ways:
let lexer = makeLexer(tokens, ["ws"]);
lexer.ignore("comment");
For sake of completeness, here is the contents of the file tokens.js
we have used in the former code:
const { moo } = require("moo-ignore");
module.exports = {
ws: { match: /\s+/, lineBreaks: true },
comment: /#[^\n]*/,
lp: "(",
rp: ")",
comma: ",",
semicolon: ";",
identifier: {
match: /[a-z_][a-z_0-9]*/,
type: moo.keywords({
fun: "fun",
end: "end",
dolua: "do"
})
}
}
See the tests (opens in a new tab) folder in this distribution for more examples of use. Here is a program that tests the former example:
const nearley = require("nearley");
const grammar = require("./test-grammar.js");
let s = `
fun (id, idtwo, idthree)
do #hello
do end;
do end # another comment
end
end`;
try {
const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));
parser.feed(s);
console.log(parser.results[0]) /* [ 'fun', 'lp', 'identifier', 'comma',
'identifier', 'comma', 'identifier', 'rp',
'dolua', 'dolua', 'end', 'semicolon',
'dolua', 'end', 'end', 'end' */
} catch (e) {
console.log(e);
}
moo-ignore: The eof option. Emitting a token to signal the End Of File
The last argument of makeLexer
is an object with configuration options:
let lexer = makeLexer(Tokens, [ tokens, to, ignore ], { options });
Currently, the only option
supported in this version is eof
.
Remember that lexers generated by moo emit undefined
when the end of the input is reached. This option changes this behavior.
If the option { eof : true }
is specified, and a token with the name EOF: "termination string"
appears in the tokens specification, moo-ignore
will concat the "termination string"
at the end of the input stream.
const { makeLexer } = require("moo-ignore");
const Tokens = {
EOF: "__EOF__",
WHITES: { match: /\s+/, lineBreaks: true },
/* etc. */
};
let lexer = makeLexer(Tokens, ["WHITES"], { eof: true });
The generated lexer will emit this EOF
token when the end of the input is reached.
Inside your grammar you'll have to explicit the use of the EOF
token. Something like this:
@{%
const { lexer } = require('./lex.js');
%}
@lexer lexer
program -> expression %EOF {% id %}
# ... other rules