The Earley Algorithm Explained

Let be a grammar G=(Ξ£,V,P,Start)G = (\Sigma, V, P, Start) and xx an input sequence of tokens x=a1a2…anx = a_{1}a_{2} \ldots a_{n}.

A state is an element (XβŸΆΞ±βˆ™Ξ²,j)(X \longrightarrow \alpha \bullet \beta, j) where

  1. X⟢αβ∈PX \longrightarrow \alpha \beta \in P is a production in the set of grammar productions PP, and
  2. The dot β€’ represents the current position in that production and
  3. The origin position j∈{0…n}j \in \{0 \ldots n \} is the position in the input at which the matching of this production X⟢αβX \longrightarrow \alpha \beta began: x=a1a2…aj…anx = a_{1}a_{2} \ldots a_{j} \ldots a_{n}.

The set of active states when the input prefix a1…aka_1 \ldots a_k is being analyzed is called SkS_k.

  • Input position 0 is the position prior to input.
  • Input position n is the position after accepting the nth token. Informally, input positions can be thought of as locations at token boundaries.

More precisely, SkS_k is the set of states (XβŸΆΞ±βˆ™Ξ²,j)(X \longrightarrow \alpha \bullet \beta, j) whose production rule X⟢αβX \longrightarrow \alpha \beta appears in a derivation from the StartStart symbol

StartβŸΉβ‹†a1a2…ajβˆ’1XΟ‰βŸΉX⟢αβa1a2…ajβˆ’1Ξ±Ξ²Ο‰βŸΉβ‹†a1a2…aj…akΞ²Ο‰Start \overset{\star}{\Longrightarrow} a_{1}a_{2} \ldots a_{j-1}X\omega \underset{X \longrightarrow \alpha \beta}{\Longrightarrow } a_{1}a_{2} \ldots a_{j-1} \alpha \beta \omega \overset{\star{}}{\Longrightarrow} a_{1}a_{2} \ldots a_j \ldots a_{k} \beta \omega

Observe that it holds that Ξ±βŸΉβ‹†aj…ak\alpha \overset{\star{}}{\Longrightarrow} a_{j} \ldots a_k

::: tip Idea So the idea is that if state (XβŸΆΞ±βˆ™Ξ²,j)(X \longrightarrow \alpha \bullet \beta, j) is in SkS_k (with j≀kj \le k) is because X⟢αβX \longrightarrow \alpha\beta was used in some partial parse tree that produces a1…aj…aka_1 \ldots a_j \ldots a_k and Ξ±\alpha has derived onto aj…aka_j \ldots a_k. :::

A state is said finished when its current position is the last position of the right side of the production (YβŸΆΞ³βˆ™,j)(Y \longrightarrow \gamma \bullet, j), that is, when there is no symbol to the right of the dot β€’ in the visual representation of the state.

The parser is seeded with S0S_0 consisting of only the top-level rule. The parser then repeatedly executes three operations:

  1. prediction,
  2. scanning, and
  3. completion.

These three operations are repeated until no new states can be added to the set SkS_k

Prediction

  1. βˆ€s=(XβŸΆΞ±βˆ™YΞ²,j)∈Sk\forall s = (X \longrightarrow \alpha \bullet Y \beta, j) \in S_k (where j is the start of the substring), and Y∈VY \in V is in the set of non terminals
  2. add states (YβŸΆβˆ™Ξ³,k)(Y \longrightarrow \bullet \gamma, k) to SkS_k βˆ€Y⟢γ\forall Y \longrightarrow \gamma grammar productions having Y on the left hand side.

Duplicate states are not added to the state set, only new ones.

Scanning

If ak∈Σa_{k} \in \Sigma is the next terminal in the input stream, then βˆ€s∈Skβˆ’1\forall s \in S_{k-1} of the form s=(XβŸΆΞ±βˆ™akΞ²,j)s = (X \longrightarrow \alpha \bullet a_{k} \beta, j) , add s=(X⟢αakβˆ™Ξ²,j)s = (X \longrightarrow \alpha a_{k} \bullet \beta, j) to SkS_{k}.

Completion

βˆ€s∈Sk\forall s \in S_{k} of the form s=(YβŸΆΞ³βˆ™,j)s = (Y \longrightarrow \gamma \bullet, j), find all states in SjS_j of the form (XβŸΆΞ±βˆ™YΞ²,i)(X \longrightarrow \alpha \bullet Y \beta, i) and add (X⟢αYβˆ™Ξ²,i)(X \longrightarrow \alpha Y \bullet \beta, i) to SkS_k.

Acceptance

The algorithm accepts if an state with the form (StartβŸΆΞ³βˆ™,0)(Start \longrightarrow \gamma \bullet, 0) ends up in SnS_n, where StartStart is the start symbol of the grammar GG and nn is the input length.

If there are several of these states it means the grammar is ambiguous.

If there are none, the input is rejected.

Algorithm Pseudocode

We augment the initial grammar with the rule Ξ³ β†’ β€’S

We also assume that the lexical analysis has been made and the parameter tokens contains an array with the token objects of the input string.

function EarleyParse(tokens, grammar) {
 
    function init(length) {
        let S = [];
        for(k =0; k <= length; k++) {
          S[k] = new Set();
        }
        return S;     
    }
 
    let S = init(tokens.length);
 
    function PREDICTOR([A β†’ Ξ±β€’BΞ², j], k) {
      grammar.rules(B).forEach((B β†’ Ξ³) => S[k].add([B β†’ β€’Ξ³, k]))
    }
 
    function SCANNER([A β†’ Ξ±β€’aΞ², j], k) {
      // "a" is a terminal like ID while tokens[k] is a token object with a "type" and a value like "temp"
      if (tokens[k].type == a) S[k+1].add([A β†’ Ξ±aβ€’Ξ², j])
    }
 
    function COMPLETER([B β†’ Ξ³β€’, s], k) {
      S[s].filter([A β†’ Ξ±β€’BΞ², j]) => S[k].add([A β†’ Ξ±Bβ€’Ξ², j])
    }
 
    S[0].add([Ξ³ β†’ β€’grammar.start, 0]); 
    for(k = 0; k <= tokens.length; k++) {
        S[k].forEach(state => {  
            if (!state.isFinished()) 
                if (state.nextElement() in grammar.nonTerminal) 
                    PREDICTOR(state, k)         // non-terminal
                else
                    SCANNER(state, k)             // terminal
            else 
                COMPLETER(state, k)  // state matches the pattern [B β†’ Ξ³β€’, s]     
        })
    }
    return S;
}

En este cΓ³digo Ξ³ denota una nueva variable sintΓ‘ctica auxiliar que produce el sΓ­mbolo de arranque grammar.start.

La funciΓ³n de aceptaciΓ³n serΓ­a:

function  ACCEPTS(S, tokens) {
  return S[tokens.length].has((Ξ³ β†’ Sβ€’, 0))
}

Ambiguous Grammar Example

Consider the following grammar:

➜  examples git:(main) βœ— cat nearley-algorithm-ambiguous.ne 
e -> e "-" e {% ([L, _, R]) => `minus(${L}, ${R})` %}
   | "1"     {% id %}

We compile it:

➜  examples git:(main) βœ— npx nearleyc nearley-algorithm-ambiguous.ne -o nearley-algorithm-ambiguous.js

and the trace the execution:

➜  examples git:(main) βœ— npx nearley-test nearley-algorithm-ambiguous.js -i '1-1-1'  
Table length: 6
Number of parses: 2
Parse Charts
Chart: 0
0: {e β†’  ● e "-" e}, from: 0
1: {e β†’  ● "1"}, from: 0
 
Chart: 1
0: {e β†’ "1" ● }, from: 0
1: {e β†’ e ● "-" e}, from: 0
 
Chart: 2
0: {e β†’ e "-" ● e}, from: 0
1: {e β†’  ● e "-" e}, from: 2
2: {e β†’  ● "1"}, from: 2
 
Chart: 3
0: {e β†’ "1" ● }, from: 2
1: {e β†’ e ● "-" e}, from: 2
2: {e β†’ e "-" e ● }, from: 0
3: {e β†’ e ● "-" e}, from: 0
 
Chart: 4
0: {e β†’ e "-" ● e}, from: 0
1: {e β†’ e "-" ● e}, from: 2
2: {e β†’  ● e "-" e}, from: 4
3: {e β†’  ● "1"}, from: 4
 
Chart: 5
0: {e β†’ "1" ● }, from: 4
1: {e β†’ e ● "-" e}, from: 4
2: {e β†’ e "-" e ● }, from: 2
3: {e β†’ e "-" e ● }, from: 0
4: {e β†’ e ● "-" e}, from: 2
5: {e β†’ e "-" e ● }, from: 0
6: {e β†’ e ● "-" e}, from: 0
7: {e β†’ e ● "-" e}, from: 0
 
 
Parse results: 
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]

To get the full derivations from the chart, you need to search for a completed state matching the pattern e β†’ Ξ±\alpha ● from 0 in the last chart.

Chart: 5
0: {e β†’ "1" ● }, from: 4
1: {e β†’ e ● "-" e}, from: 4
2: {e β†’ e "-" e ● }, from: 2
3: {e β†’ e "-" e ● }, from: 0
4: {e β†’ e ● "-" e}, from: 2
6: {e β†’ e ● "-" e}, from: 0

The fact that state 3 {e β†’ e "-" e ● }, from: 0 exist means the state {e β†’ e "-" ● e }, from: 0 must also exist somewhere. In this case such state appears in Charts 2 and 4.

The advance from {e β†’ e "-" ● e }, from: 0 (Chart 4) to {e β†’ e "-" e ● }, from: 0 in Chart 5 can be produced due to any of the completed states 0 and 2 in Chart 5 0: {e β†’ "1" ● } from: 4 and 2: {e β†’ e "-" e ● }, from: 2, that leads to two different syntax trees for 1-1-1:

Parse results: 
[ 'minus(minus(1, 1), 1)', 'minus(1, minus(1, 1))' ]

Second Example

Consider this example from Wikipedia (opens in a new tab):

# https://en.wikipedia.org/wiki/Earley_parser#Example

P -> S

S -> S "+" M
   | M

M -> M "*" T
   | T

T -> "1"
   | "2"
   | "3"
   | "4"

executed with nearley-test wikipedia.js -i '2+3*4'

➜  examples git:(main) βœ— nearleyc wikipedia.ne -o wikipedia.js 
➜  examples git:(main) βœ— nearley-test wikipedia.js -i '2+3*4'
Table length: 6
Number of parses: 1
Parse Charts
Chart: 0
0: {P β†’  ● S}, from: 0
1: {S β†’  ● S "+" M}, from: 0
2: {S β†’  ● M}, from: 0
3: {M β†’  ● M "*" T}, from: 0
4: {M β†’  ● T}, from: 0
5: {T β†’  ● "1"}, from: 0
6: {T β†’  ● "2"}, from: 0
7: {T β†’  ● "3"}, from: 0
8: {T β†’  ● "4"}, from: 0
 
Chart: 1
0: {T β†’ "2" ● }, from: 0
1: {M β†’ T ● }, from: 0
2: {M β†’ M ● "*" T}, from: 0
3: {S β†’ M ● }, from: 0
4: {S β†’ S ● "+" M}, from: 0
5: {P β†’ S ● }, from: 0
 
Chart: 2
0: {S β†’ S "+" ● M}, from: 0
1: {M β†’  ● M "*" T}, from: 2
2: {M β†’  ● T}, from: 2
3: {T β†’  ● "1"}, from: 2
4: {T β†’  ● "2"}, from: 2
5: {T β†’  ● "3"}, from: 2
6: {T β†’  ● "4"}, from: 2
 
Chart: 3
0: {T β†’ "3" ● }, from: 2
1: {M β†’ T ● }, from: 2
2: {M β†’ M ● "*" T}, from: 2
3: {S β†’ S "+" M ● }, from: 0
4: {S β†’ S ● "+" M}, from: 0
5: {P β†’ S ● }, from: 0
 
Chart: 4
0: {M β†’ M "*" ● T}, from: 2
1: {T β†’  ● "1"}, from: 4
2: {T β†’  ● "2"}, from: 4
3: {T β†’  ● "3"}, from: 4
4: {T β†’  ● "4"}, from: 4
 
Chart: 5
0: {T β†’ "4" ● }, from: 4
1: {M β†’ M "*" T ● }, from: 2
2: {M β†’ M ● "*" T}, from: 2
3: {S β†’ S "+" M ● }, from: 0
4: {S β†’ S ● "+" M}, from: 0
5: {P β†’ S ● }, from: 0
 
 
Parse results: 
[
  [
    [ [ [ [ '2' ] ] ], '+', [ [ [ '3' ] ], '*', [ '4' ] ] ]
  ]
]

The state {P β†’ S ● }, from: 0 represents a completed parse. This state also appears in charts 1 (corresponding to 2 β€’ + 3 * 4) and 3 (2 + 3 β€’ * 4), since they are complete sentences.

Loup Blog "Earley Parsing Explained"

Advanced: Optimizing Right Recursion. Loup blog

Toby Ho video "on the Earley Parsing Algorithm"

Toby Ho illustrates step by step the algorithm.

Video on "Earley Parsing" by Natalie Parde

"Early Parser" in the Wikipedia

Marpa: a Perl parser generator based on the Earley Algorithm

See What is the Marpa algorithm? (opens in a new tab) By Jeffrey Kegler on November 18, 2011 in Jeffrey Kegler's Blog (opens in a new tab)

Other Earley Parsers in JS

NearleyJS: A parser generator based on the Earley Algorithm

See section NearleyJS

lagodiuk/earley-parser-js

See repo lagodiuk/earley-parser-js (opens in a new tab).

The implementation is in file earley-oop.js (opens in a new tab).

A fork of this repo for ULL-ESIT-PL is in https://github.com/ULL-ESIT-PL/earley-parser-js (opens in a new tab)

Others

See also npm packages