The Earley Algorithm Explained
Let be a grammar and an input sequence of tokens .
A state is an element where
- is a production in the set of grammar productions , and
- The dot β’ represents the current position in that production and
- The origin position is the position in the input at which the matching of this production began: .
The set of active states when the input prefix is being analyzed is called .
- Input position 0 is the position prior to input.
- Input position
n
is the position after accepting the nth token. Informally, input positions can be thought of as locations at token boundaries.
More precisely, is the set of states whose production rule appears in a derivation from the symbol
Observe that it holds that
::: tip Idea So the idea is that if state is in (with ) is because was used in some partial parse tree that produces and has derived onto . :::
A state is said finished when its current position is the last position of the right side of the production , that is, when there is no symbol to the right of the dot β’ in the visual representation of the state.
The parser is seeded with consisting of only the top-level rule. The parser then repeatedly executes three operations:
- prediction,
- scanning, and
- completion.
These three operations are repeated until no new states can be added to the set
Prediction
- (where j is the start of the substring), and is in the set of non terminals
- add states to grammar productions having Y on the left hand side.
Duplicate states are not added to the state set, only new ones.
Scanning
If is the next terminal in the input stream, then of the form , add to .
Completion
of the form , find all states in of the form and add to .
Acceptance
The algorithm accepts if an state with the form ends up in , where is the start symbol of the grammar and is the input length.
If there are several of these states it means the grammar is ambiguous.
If there are none, the input is rejected.
Algorithm Pseudocode
We augment the initial grammar with the rule Ξ³ β β’S
We also assume that the lexical analysis has been made and the parameter tokens
contains an array with the token objects of the input string.
function EarleyParse(tokens, grammar) {
function init(length) {
let S = [];
for(k =0; k <= length; k++) {
S[k] = new Set();
}
return S;
}
let S = init(tokens.length);
function PREDICTOR([A β Ξ±β’BΞ², j], k) {
grammar.rules(B).forEach((B β Ξ³) => S[k].add([B β β’Ξ³, k]))
}
function SCANNER([A β Ξ±β’aΞ², j], k) {
// "a" is a terminal like ID while tokens[k] is a token object with a "type" and a value like "temp"
if (tokens[k].type == a) S[k+1].add([A β Ξ±aβ’Ξ², j])
}
function COMPLETER([B β Ξ³β’, s], k) {
S[s].filter([A β Ξ±β’BΞ², j]) => S[k].add([A β Ξ±Bβ’Ξ², j])
}
S[0].add([Ξ³ β β’grammar.start, 0]);
for(k = 0; k <= tokens.length; k++) {
S[k].forEach(state => {
if (!state.isFinished())
if (state.nextElement() in grammar.nonTerminal)
PREDICTOR(state, k) // non-terminal
else
SCANNER(state, k) // terminal
else
COMPLETER(state, k) // state matches the pattern [B β Ξ³β’, s]
})
}
return S;
}
En este cΓ³digo Ξ³
denota una nueva variable sintΓ‘ctica auxiliar que produce el sΓmbolo de arranque grammar.start
.
La funciΓ³n de aceptaciΓ³n serΓa:
function ACCEPTS(S, tokens) {
return S[tokens.length].has((Ξ³ β Sβ’, 0))
}
Ambiguous Grammar Example
Consider the following grammar:
β examples git:(main) β cat nearley-algorithm-ambiguous.ne
e -> e "-" e {% ([L, _, R]) => `minus(${L}, ${R})` %}
| "1" {% id %}
We compile it:
β examples git:(main) β npx nearleyc nearley-algorithm-ambiguous.ne -o nearley-algorithm-ambiguous.js
and the trace the execution:
β examples git:(main) β npx nearley-test nearley-algorithm-ambiguous.js -i '1-1-1'
Table length: 6
Number of parses: 2
Parse Charts
Chart: 0
0: {e β β e "-" e}, from: 0
1: {e β β "1"}, from: 0
Chart: 1
0: {e β "1" β }, from: 0
1: {e β e β "-" e}, from: 0
Chart: 2
0: {e β e "-" β e}, from: 0
1: {e β β e "-" e}, from: 2
2: {e β β "1"}, from: 2
Chart: 3
0: {e β "1" β }, from: 2
1: {e β e β "-" e}, from: 2
2: {e β e "-" e β }, from: 0
3: {e β e β "-" e}, from: 0
Chart: 4
0: {e β e "-" β e}, from: 0
1: {e β e "-" β e}, from: 2
2: {e β β e "-" e}, from: 4
3: {e β β "1"}, from: 4
Chart: 5
0: {e β "1" β }, from: 4
1: {e β e β "-" e}, from: 4
2: {e β e "-" e β }, from: 2
3: {e β e "-" e β }, from: 0
4: {e β e β "-" e}, from: 2
5: {e β e "-" e β }, from: 0
6: {e β e β "-" e}, from: 0
7: {e β e β "-" e}, from: 0
Parse results:
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]
To get the full derivations from the chart, you need to search for a completed state matching the pattern e β
β from 0
in the last chart.
Chart: 5
0: {e β "1" β }, from: 4
1: {e β e β "-" e}, from: 4
2: {e β e "-" e β }, from: 2
3: {e β e "-" e β }, from: 0
4: {e β e β "-" e}, from: 2
6: {e β e β "-" e}, from: 0
The fact that state 3 {e β e "-" e β }, from: 0
exist means the state {e β e "-" β e }, from: 0
must also exist somewhere.
In this case such state appears in Charts 2 and 4.
The advance from {e β e "-" β e }, from: 0
(Chart 4) to {e β e "-" e β }, from: 0
in Chart 5
can be produced due to any of the completed states
0 and 2 in Chart 5 0: {e β "1" β } from: 4
and 2: {e β e "-" e β }, from: 2
, that leads to two different syntax trees for 1-1-1
:
Parse results:
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]
Second Example
Consider this example from Wikipedia (opens in a new tab):
# https://en.wikipedia.org/wiki/Earley_parser#Example
P -> S
S -> S "+" M
| M
M -> M "*" T
| T
T -> "1"
| "2"
| "3"
| "4"
executed with nearley-test wikipedia.js -i '2+3*4'
β examples git:(main) β nearleyc wikipedia.ne -o wikipedia.js
β examples git:(main) β nearley-test wikipedia.js -i '2+3*4'
Table length: 6
Number of parses: 1
Parse Charts
Chart: 0
0: {P β β S}, from: 0
1: {S β β S "+" M}, from: 0
2: {S β β M}, from: 0
3: {M β β M "*" T}, from: 0
4: {M β β T}, from: 0
5: {T β β "1"}, from: 0
6: {T β β "2"}, from: 0
7: {T β β "3"}, from: 0
8: {T β β "4"}, from: 0
Chart: 1
0: {T β "2" β }, from: 0
1: {M β T β }, from: 0
2: {M β M β "*" T}, from: 0
3: {S β M β }, from: 0
4: {S β S β "+" M}, from: 0
5: {P β S β }, from: 0
Chart: 2
0: {S β S "+" β M}, from: 0
1: {M β β M "*" T}, from: 2
2: {M β β T}, from: 2
3: {T β β "1"}, from: 2
4: {T β β "2"}, from: 2
5: {T β β "3"}, from: 2
6: {T β β "4"}, from: 2
Chart: 3
0: {T β "3" β }, from: 2
1: {M β T β }, from: 2
2: {M β M β "*" T}, from: 2
3: {S β S "+" M β }, from: 0
4: {S β S β "+" M}, from: 0
5: {P β S β }, from: 0
Chart: 4
0: {M β M "*" β T}, from: 2
1: {T β β "1"}, from: 4
2: {T β β "2"}, from: 4
3: {T β β "3"}, from: 4
4: {T β β "4"}, from: 4
Chart: 5
0: {T β "4" β }, from: 4
1: {M β M "*" T β }, from: 2
2: {M β M β "*" T}, from: 2
3: {S β S "+" M β }, from: 0
4: {S β S β "+" M}, from: 0
5: {P β S β }, from: 0
Parse results:
[
[
[ [ [ [ '2' ] ] ], '+', [ [ [ '3' ] ], '*', [ '4' ] ] ]
]
]
The state {P β S β }, from: 0
represents a completed parse. This state also appears in charts 1
(corresponding to 2 β’ + 3 * 4
) and 3 (2 + 3 β’ * 4
), since they are complete sentences.
Loup Blog "Earley Parsing Explained"
Advanced: Optimizing Right Recursion. Loup blog
-
Joop Leoβs optimizations for right-Recursion original paper
Toby Ho video "on the Earley Parsing Algorithm"
Toby Ho illustrates step by step the algorithm.
Video on "Earley Parsing" by Natalie Parde
"Early Parser" in the Wikipedia
Marpa: a Perl parser generator based on the Earley Algorithm
See What is the Marpa algorithm? (opens in a new tab) By Jeffrey Kegler on November 18, 2011 in Jeffrey Kegler's Blog (opens in a new tab)
Other Earley Parsers in JS
NearleyJS: A parser generator based on the Earley Algorithm
See section NearleyJS
lagodiuk/earley-parser-js
See repo lagodiuk/earley-parser-js (opens in a new tab).
The implementation is in file earley-oop.js (opens in a new tab).
A fork of this repo for ULL-ESIT-PL is in https://github.com/ULL-ESIT-PL/earley-parser-js (opens in a new tab)
Others
See also npm packages