Como Escribir un Generador de Analizadores Léxicos
lastIndex
Regular expression objects have properties.
One such property is source
, which contains the string that expression was created from.
Another property is lastIndex
, which controls, in some limited circumstances, where the next match will start.
If your regular expression uses the g
flag, you can use the exec
method multiple times to find successive matches in the same string.
When you do so, the search starts at the substring specified
by the regular expression’s lastIndex
property.
> re = /d(b+)(d)/ig
/d(b+)(d)/gi
> z = "dBdxdbbdzdbd"
'dBdxdbbdzdbd'
> result = re.exec(z)
[ 'dBd', 'B', 'd', index: 0, input: 'dBdxdbbdzdbd' ]
> re.lastIndex
3
> result = re.exec(z)
[ 'dbbd', 'bb', 'd', index: 4, input: 'dBdxdbbdzdbd' ]
> re.lastIndex
8
> result = re.exec(z)
[ 'dbd', 'b', 'd', index: 9, input: 'dBdxdbbdzdbd' ]
> re.lastIndex
12
> z.length
12
> result = re.exec(z)
null
Thus, we can write a loop like this:
let input = "A string with 3 numbers in it... 42 and 88.";
let number = /\b\d+\b/g;
let match;
while (match = number.exec(input)) {
console.log("Found", match[0], "at", match.index);
}
// → Found 3 at 14
// Found 42 at 33
// Found 88 at 40
Sticky flag "y", searching at position
Regular expressions can have options, which are written after the closing slash.
- The
g
option makes the expression global, which, among other things, causes thereplace
method to replace all instances instead of just the first. - The
y
option makes it sticky, which means that it will not search ahead and skip part of the string when looking for a match.
The difference between the global and the sticky options is that, when sticky is enabled, the match will succeed only if it starts directly at lastIndex
, whereas with global, it will search ahead for a position where a match can start.
let global = /abc/g;
console.log(global.exec("xyz abc"));
// → ["abc"]
let sticky = /abc/y;
console.log(sticky.exec("xyz abc"));
// → null
let str = 'let varName = "value"';
let regexp = /\w+/y;
regexp.lastIndex = 3;
console.log( regexp.exec(str) ); // null (there's a space at position 3, not a word)
regexp.lastIndex = 4;
console.log( regexp.exec(str) ); // varName (word at position 4)
Véase también:
Named groups
Remembering groups by their numbers can be hard. An option is to give names to parentheses.
That's done by starting the capture regexp parenthesis by (?<name>
and ending with )
.
For example, let's look for a date in the format "year-month-day":
let dateRegexp = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
let str = "2019-04-30";
let groups = str.match(dateRegexp).groups;
console.log(groups.year); // 2019
console.log(groups.month); // 04
console.log(groups.day); // 30
As you can see, the groups reside in the .groups
property of the match.
To look for all dates, we can add flag pattern:g
.
> dateRegexp = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g;
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g
> str = "2019-10-30 2020-01-01";
'2019-10-30 2020-01-01'
We can use matchAll
to obtain full matches, together with groups.
The matchAll()
method returns an iterator of all results matching a string against a regular expression, including capturing groups.
> results = str.matchAll(dateRegexp)
Object [RegExp String Iterator] {}
> for(let result of results) {
... let {year, month, day} = result.groups;
... console.log(`${day}.${month}.${year}`); }
30.10.2019
01.01.2020
Sugerencias para la construcción de buildLexer
El siguiente código ilustra el uso combinado de la opción sticky y los grupos con nombre para encontrar la solución a esta práctica:
const str = 'const varName = "value"';
console.log(str);
const SPACE = /(?<SPACE>\s+)/;
const RESERVEDWORD = /(?<RESERVEDWORD>\b(const|let)\b)/;
const ID = /(?<ID>([a-z_]\w+))/;
const STRING = /(?<STRING>"([^\\"]|\\.")*")/;
const OP = /(?<OP>[+*\/=-])/;
const tokens = [
['SPACE', SPACE], ['RESERVEDWORD', RESERVEDWORD], ['ID', ID],
['STRING', STRING], ['OP', OP]
];
const tokenNames = tokens.map(t => t[0]);
const tokenRegs = tokens.map(t => t[1]);
const buildOrRegexp = (regexps) => {
const sources = regexps.map(r => r.source);
const union = sources.join('|');
// console.log(union);
return new RegExp(union, 'y');
};
const regexp = buildOrRegexp(tokenRegs);
const getToken = (m) => tokenNames.find(tn => typeof m[tn] !== 'undefined');
let match;
while (match = regexp.exec(str)) {
//console.log(match.groups);
let t = getToken(match.groups);
console.log(`Found token '${t}' with value '${match.groups[t]}'`);
}
escribiendo una función buildLexer
que recibe como argumentos un array tokens
como en el ejemplo y retorna una función que hace el análisis léxico
correspondiente a esos tokens.
Como obtener el nombre de una RegExp con nombre
Consideremos una expresión regular con nombre:
> NUM = /(?<NUM>\d+)/
/(?<NUM>\d+)/
La siguiente expresión regular tiene dos paréntesis para capturar el nombre y el resto de la regexp:
> getName = /^[(][?]<(\w+)>(.+)[)]$/
/^[(][?]<(\w+)>(.+)[)]$/
Cuando se tiene una regexp Re
, el atributo Re.source
contiene la cadena que defina la expresión regular.
Asi pues, cuando hacemos getName.exec(NUM.source)
obtenemos:
> r = getName.exec(NUM.source)
[
'(?<NUM>\\d+)',
'NUM',
'\\d+',
index: 0,
input: '(?<NUM>\\d+)',
groups: undefined
]
En el primer paréntesis casamos con el nombre y en el segundo con la regexp:
> r[1]
'NUM'
> r[2]
'\\d+'
Para que nuestro generador de analizadores léxicos pueda funcionar cada una de las regexp proveídas debe tener un único paréntesis con nombre. Podemos comprobar si el cuerpo de la regexp en r[2]
contiene mas paréntesis con nombre haciendo algo como esto:
> OP = /(?<OP>(?<OP2>[+*\/=-]))/
/(?<OP>(?<OP2>[+*\/=-]))/
> r = getName.exec(OP.source)
[
'(?<OP>(?<OP2>[+*\\/=-]))',
'OP',
'(?<OP2>[+*\\/=-])',
index: 0,
input: '(?<OP>(?<OP2>[+*\\/=-]))',
groups: undefined
]
> hasNamedParen = /[(][?]<(\w+)>(.+)[)]/
/[(][?]<(\w+)>(.+)[)]/
> hasNamedParen.exec(r[2])
[
'(?<OP2>[+*\\/=-])',
'OP2',
'[+*\\/=-]',
index: 0,
input: '(?<OP2>[+*\\/=-])',
groups: undefined
]
Referencias
- Tema Expresiones Regulares y Análisis Léxico
- Lab lexer-generator
- Example: using sticky matching for tokenizing (opens in a new tab) inside the chapter New regular expression features in ECMAScript 6 (opens in a new tab)
Referencias Adicionales sobre Análisis Léxico
- Ejemplo de Analizador Léxico para JS (opens in a new tab)
- Descripción de la Práctica: Analizador Léxico para Un Subconjunto de JavaScript (opens in a new tab) gitbooks.io
- Compiler Construction by Wikipedians (opens in a new tab). Chapter Lexical Analysis
- Un caso a estudiar: El módulo npm lexical-parser (opens in a new tab)
- Esprima. Chapter 3. Lexical Analysis (Tokenization) (opens in a new tab)
- jison-lex (opens in a new tab): A lexical analyzer generator used by jison. It takes a lexical grammar definition (either in JSON or Bison's lexical grammar format) and outputs a JavaScript lexer.
- lexer (opens in a new tab). A JavaScript lexer modelled after flex