Como Escribir un Generador de Analizadores Léxicos

lastIndex

EJS: The lastIndex property (opens in a new tab)

Regular expression objects have properties.

One such property is source, which contains the string that expression was created from.

Another property is lastIndex, which controls, in some limited circumstances, where the next match will start.

If your regular expression uses the g flag, you can use the exec method multiple times to find successive matches in the same string. When you do so, the search starts at the substring specified by the regular expression’s lastIndex property.

      > re = /d(b+)(d)/ig
      /d(b+)(d)/gi
      > z = "dBdxdbbdzdbd"
      'dBdxdbbdzdbd'
      > result = re.exec(z)
      [ 'dBd', 'B', 'd', index: 0, input: 'dBdxdbbdzdbd' ]
      > re.lastIndex
      3
      > result = re.exec(z)
      [ 'dbbd', 'bb', 'd', index: 4, input: 'dBdxdbbdzdbd' ]
      > re.lastIndex
      8
      > result = re.exec(z)
      [ 'dbd', 'b', 'd', index: 9, input: 'dBdxdbbdzdbd' ]
      > re.lastIndex
      12
      > z.length
      12
      > result = re.exec(z)
      null

Thus, we can write a loop like this:

let input = "A string with 3 numbers in it... 42 and 88.";
let number = /\b\d+\b/g;
let match;
while (match = number.exec(input)) {
  console.log("Found", match[0], "at", match.index);
}
// → Found 3 at 14
//   Found 42 at 33
//   Found 88 at 40

Sticky flag "y", searching at position

Regular expressions can have options, which are written after the closing slash.

The g option makes the expression global, which, among other things, causes the replace method to replace all instances instead of just the first.
The y option makes it sticky, which means that it will not search ahead and skip part of the string when looking for a match.

The difference between the global and the sticky options is that, when sticky is enabled, the match will succeed only if it starts directly at lastIndex, whereas with global, it will search ahead for a position where a match can start.

let global = /abc/g;
console.log(global.exec("xyz abc"));
// → ["abc"]
let sticky = /abc/y;
console.log(sticky.exec("xyz abc"));
// → null

let str = 'let varName = "value"';
 
let regexp = /\w+/y;
 
regexp.lastIndex = 3;
console.log( regexp.exec(str) ); // null (there's a space at position 3, not a word)
 
regexp.lastIndex = 4;
console.log( regexp.exec(str) ); // varName (word at position 4)

Véase también:

Sticky flag "y", searching at position (opens in a new tab)

Named groups

Remembering groups by their numbers can be hard. An option is to give names to parentheses.

That's done by starting the capture regexp parenthesis by (?<name> and ending with ). For example, let's look for a date in the format "year-month-day":

let dateRegexp = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
let str = "2019-04-30";
 
let groups = str.match(dateRegexp).groups;
 
console.log(groups.year); // 2019
console.log(groups.month); // 04
console.log(groups.day); // 30

As you can see, the groups reside in the .groups property of the match.

To look for all dates, we can add flag pattern:g.

> dateRegexp = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g;
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g
> str = "2019-10-30 2020-01-01";
'2019-10-30 2020-01-01'

We can use matchAll to obtain full matches, together with groups.

The matchAll() method returns an iterator of all results matching a string against a regular expression, including capturing groups.

> results = str.matchAll(dateRegexp)
Object [RegExp String Iterator] {}
> for(let result of results) {
...   let {year, month, day} = result.groups;
...   console.log(`${day}.${month}.${year}`); }
30.10.2019
01.01.2020

Sugerencias para la construcción de buildLexer

El siguiente código ilustra el uso combinado de la opción sticky y los grupos con nombre para encontrar la solución a esta práctica:

const str = 'const varName = "value"';
console.log(str);
 
const SPACE = /(?<SPACE>\s+)/;
const RESERVEDWORD = /(?<RESERVEDWORD>\b(const|let)\b)/;
const ID = /(?<ID>([a-z_]\w+))/;
const STRING = /(?<STRING>"([^\\"]|\\.")*")/;
const OP = /(?<OP>[+*\/=-])/;
 
const tokens = [
  ['SPACE', SPACE], ['RESERVEDWORD', RESERVEDWORD], ['ID', ID], 
  ['STRING', STRING], ['OP', OP] 
];
 
const tokenNames = tokens.map(t => t[0]);
const tokenRegs  = tokens.map(t => t[1]);
 
const buildOrRegexp = (regexps) => {
  const sources = regexps.map(r => r.source);
  const union = sources.join('|');
  // console.log(union);
  return new RegExp(union, 'y');
};
 
const regexp = buildOrRegexp(tokenRegs);
 
const getToken = (m) => tokenNames.find(tn => typeof m[tn] !== 'undefined');
 
let match;
while (match = regexp.exec(str)) {
  //console.log(match.groups);
  let t = getToken(match.groups);
  console.log(`Found token '${t}' with value '${match.groups[t]}'`);
}

escribiendo una función buildLexer que recibe como argumentos un array tokens como en el ejemplo y retorna una función que hace el análisis léxico correspondiente a esos tokens.

Como obtener el nombre de una RegExp con nombre

Consideremos una expresión regular con nombre:

> NUM = /(?<NUM>\d+)/
/(?<NUM>\d+)/

La siguiente expresión regular tiene dos paréntesis para capturar el nombre y el resto de la regexp:

> getName = /^[(][?]<(\w+)>(.+)[)]$/
/^[(][?]<(\w+)>(.+)[)]$/

Cuando se tiene una regexp Re, el atributo Re.source contiene la cadena que defina la expresión regular.

Asi pues, cuando hacemos getName.exec(NUM.source) obtenemos:

> r = getName.exec(NUM.source)
[
  '(?<NUM>\\d+)',
  'NUM',
  '\\d+',
  index: 0,
  input: '(?<NUM>\\d+)',
  groups: undefined
]

En el primer paréntesis casamos con el nombre y en el segundo con la regexp:

> r[1]
'NUM'
> r[2]
'\\d+'

Para que nuestro generador de analizadores léxicos pueda funcionar cada una de las regexp proveídas debe tener un único paréntesis con nombre. Podemos comprobar si el cuerpo de la regexp en r[2] contiene mas paréntesis con nombre haciendo algo como esto:

> OP = /(?<OP>(?<OP2>[+*\/=-]))/
/(?<OP>(?<OP2>[+*\/=-]))/
> r = getName.exec(OP.source)
[
  '(?<OP>(?<OP2>[+*\\/=-]))',
  'OP',
  '(?<OP2>[+*\\/=-])',
  index: 0,
  input: '(?<OP>(?<OP2>[+*\\/=-]))',
  groups: undefined
]

> hasNamedParen = /[(][?]<(\w+)>(.+)[)]/
/[(][?]<(\w+)>(.+)[)]/
> hasNamedParen.exec(r[2])
[
  '(?<OP2>[+*\\/=-])',
  'OP2',
  '[+*\\/=-]',
  index: 0,
  input: '(?<OP2>[+*\\/=-])',
  groups: undefined
]

Referencias

Referencias Adicionales sobre Análisis Léxico

Ejemplo de Analizador Léxico para JS (opens in a new tab)
Descripción de la Práctica: Analizador Léxico para Un Subconjunto de JavaScript (opens in a new tab) gitbooks.io
Compiler Construction by Wikipedians (opens in a new tab). Chapter Lexical Analysis
Un caso a estudiar: El módulo npm lexical-parser (opens in a new tab)
Esprima. Chapter 3. Lexical Analysis (Tokenization) (opens in a new tab)
- RepoULL-ESIT-GRADOII-PL/esprima-pegjs-jsconfeu-talk (opens in a new tab)
jison-lex (opens in a new tab): A lexical analyzer generator used by jison. It takes a lexical grammar definition (either in JSON or Bison's lexical grammar format) and outputs a JavaScript lexer.
lexer (opens in a new tab). A JavaScript lexer modelled after flex

Introducción a las Expresiones Regulares Unicode, UTF-16 y JavaScript