Gramáticas y Lenguajes Generados
Gramáticas Independientes del Contexto
Supongamos una gramática con alfabeto , conjunto de variables sintácticas (o no terminales) , reglas de producción y símbolo de arranque .
Por ejemplo, en la gramática de Egg este es el conjunto de reglas de producción:
expression: STRING
| NUMBER
| WORD apply
apply: /* vacio */
| '(' (expression ',')* expression? ')' apply
Sólo hay dos variables sintácticas . El símbolo de arranque es .
El conjunto de tokens es:
Observe que algunos de los tokens son a su vez lenguajes de cierta complejidad, cuya definición está en otro nivel de abstracción, el nivel léxico y que se pueden definir mediante un mecanismo mas secillo como son las expresiones regulares.
Por ejemplo, en una definición de Egg inicial podríamos definir así lo que entendemos por espacios o blancos, esto es, que partes del texto no son significativas para que nuestro programa pueda entender la estructura de la frase:
WHITES = /(\s|[#;].*|\/\*(.|\n)*?\*\/)*/
así como los tokens mas complejos:
STRING = /"((?:[^"\\]|\\.)*)"/
NUMBER = /([-+]?\d*\.?\d+([eE][-+]?\d+)?)/
WORD = /([^\s(),"]+)/
Ejercicio
Construye una derivación para la frase
print(**(g,f)(8))
Observa que el resultado del análisis léxico sería un stream como este:
WORD["print"] "(" WORD[**] "(" WORD[g] "," WORD[f] ")" "(" NUMBER[8] ")" ")"
Solución:
En la solución que sigue, abreviamos expression por , apply por , WORD por y NUMBER por :
(Aquí )
(Ya que )
(Ya que )
(Aquí hizo )
(Aquí hizo )
(La última a hizo )
(La última hace )
(después de aplicar reiteradas veces las reglas)
En forma gráfica, tenemos el árbol sintáctico concreto que sigue:
Este es el mismo diagrama hecho usando mermaid (opens in a new tab):
Lenguaje Generado por Una Gramática
Para cada variable sintáctica el lenguaje generado desde la variable se define como:
Esto es, es el conjunto de frases del alfabeto que derivan en varias substituciones desde la variable .
El lenguaje Egg es el conjunto de frases donde es la gramática definida arriba.
El problema a considerar es el de construir para un lenguaje una función parseA()
que reconozca las frases del lenguaje .
Siguiendo con el ejemplo de Egg, en tenemos frases como:
()
(4,b)
(4, +(5,c))
(4,)
/* nada */
Recuerda que:
apply: /* vacio */
| '(' (expression ',')* expression? ')' apply
y que:
ECMAScript A Complex Language Specification
This Ecma Standard (opens in a new tab) defines the ECMAScript 2022 Language.
It is the thirteenth edition of the ECMAScript Language Specification. Since publication of the first edition in 1997, ECMAScript has grown to be one of the world's most widely used general-purpose programming languages. It is best known as the language embedded in web browsers but has also been widely adopted for server and embedded applications.
Although ECMAScript started as a language with a simple design, over the years that design has become more and more complex. The following section is just an illustration of how some design decisions have led to increased complexity in interpreting and implementing the language.
Lexical Ambiguity Example
The source text of an ECMAScript is first converted into a sequence of input elements, which are
- tokens,
- line terminators,
- comments, or
- white space.
The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
In ECMAScript, there are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements.
This requires multiple goal symbols for the lexical grammar. The use of multiple lexical goals ensures that there are no lexical ambiguities that would affect automatic semicolon insertion.
For example, there are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral (opens in a new tab) are permitted.
This is not affected by semicolon insertion (see 12.5 (opens in a new tab)); in examples such as lines 4 and 5 in the following code:
let {a, b, hi, g, c, d} = require('./hidden-amb')
a = b
/hi/g.exec(c).map(d)
console.log(a);
where the first non-whitespace, non-comment code point after a LineTerminator (opens in a new tab) is the /
(U+002F unicode name SOLIDUS) and the syntactic context allows division or division-assignment, no semicolon is inserted at the LineTerminator
!.
That is, the above example is interpreted in the same way as:
a = b / hi / g.exec(c).map(d);
When we run the code above, we get:
➜ prefix-lang git:(master) ✗ node examples/lexical-ambiguity.js
1
The contents of file examples/hidden-amb.js
explain why the output is 1
:
let tutu = { map(_) { return 2}}
let a = 5, b = 8, hi = 4, c = "hello", d =
g = { exec(_) { return tutu; }}
module.exports = {a, b, hi, c, d, g}
See the code in the repo crguezl/js-lexical-ambiguity (opens in a new tab)
ECMAScript Language: Grammar
- A Grammar Summary (opens in a new tab) (Appendix with the whole grammar)
- [A.2] Expressions (opens in a new tab)
- [A.3] Statements (opens in a new tab)
- [A.4] Functions and Classes (opens in a new tab)
- [A.5] Scripts and Modules (opens in a new tab)
- [A.6] Number Conversions (opens in a new tab)
- [A.7] Universal Resource Identifier Character Classes (opens in a new tab)
- [A.8] Regular Expressions (opens in a new tab)
ECMAScript Language: Lexical Specification
- [A.1] Lexical Grammar (opens in a new tab)
- 12 ECMAScript Language: Lexical Grammar (opens in a new tab)
- 11 ECMAScript Language: Source Text (opens in a new tab)
ECMA TC39 at GitHub
- Github Organization Ecma TC39: Ecma International, Technical Committee 39 - ECMAScript (opens in a new tab)
- This Github repository contains the source for the current draft of ECMA-262 (opens in a new tab)
The Design of Programming Languages
See section The Design of Programming Languages