LR(1) grammar/parser and lookahead-restrictions

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
In the past, it has been said (usually by Brendan Eich) that TC39 intends
that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so
much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <!
terminals]) fit into the LR approach. It seems like there are three
possibilities:
   - modify the grammar (before the parser is constructed),
   - modify the construction of the parser, and/or
   - modify the operation of the parser.
It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If
so, how did you you handle lookahead-restrictions?

-Michael
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
One yet has to check the pure ES6+ grammar to be LR(1) compatible (the Cover grammar with nodes reinterpretations makes it harder), but for a very practical use-case, e.g. combined with Flow, it'll not be easy to have an automated parser (since Flow's parser uses backtracking for some edge cases which is already not LR-parser). 

I have some subset of ES6 working as an example language for Syntax tool[1] (since I just took a bunch of productions from JS), but I haven't tried the full ES6 grammar (not having time to work on this project, since it'll be not a straightforward process of just matching the grammar 1:1 to the parser generator -- based on the above reasons).

Long time ago I also had a brief discussion with authors of another popular parser generators for JS. Both didn't try it as well, but mentioned that even for ES5 it was hard to parse e.g. regexp, etc.

Someone (with enough time for the project) should just sit, and spend some time actually porting the grammar from the spec on some powerful parsers generator.


On Wed, Jan 11, 2017 at 10:28 AM, Michael Dyck <[hidden email]> wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:
  - modify the grammar (before the parser is constructed),
  - modify the construction of the parser, and/or
  - modify the operation of the parser.
It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

-Michael
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Isiah Meadows-2

Heuristically, I doubt it's even context-free at this point, considering the concept and widespread prevalence of early errors now. I suspect it's mildly context-sensitive (maybe tree-adjoining?), but I'm no formal language expert here.


On Wed, Jan 11, 2017, 14:23 Dmitry Soshnikov <[hidden email]> wrote:
One yet has to check the pure ES6+ grammar to be LR(1) compatible (the Cover grammar with nodes reinterpretations makes it harder), but for a very practical use-case, e.g. combined with Flow, it'll not be easy to have an automated parser (since Flow's parser uses backtracking for some edge cases which is already not LR-parser). 

I have some subset of ES6 working as an example language for Syntax tool[1] (since I just took a bunch of productions from JS), but I haven't tried the full ES6 grammar (not having time to work on this project, since it'll be not a straightforward process of just matching the grammar 1:1 to the parser generator -- based on the above reasons).

Long time ago I also had a brief discussion with authors of another popular parser generators for JS. Both didn't try it as well, but mentioned that even for ES5 it was hard to parse e.g. regexp, etc.

Someone (with enough time for the project) should just sit, and spend some time actually porting the grammar from the spec on some powerful parsers generator.


On Wed, Jan 11, 2017 at 10:28 AM, Michael Dyck <[hidden email]> wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:
  - modify the grammar (before the parser is constructed),
  - modify the construction of the parser, and/or
  - modify the operation of the parser.
It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

-Michael
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
On 17-01-11 09:09 PM, Isiah Meadows wrote:
> Heuristically, I doubt it's even context-free at this point, considering the
> concept and widespread prevalence of early errors now. I suspect it's mildly
> context-sensitive (maybe tree-adjoining?), but I'm no formal language expert
> here.

I think the claim of/desire for LR(1)-ness only applies to the language
defined by the BNF-ish rules, i.e. ignoring the effects of prose-y things
like early errors and ASI. Anyhow, that's the language I'm asking about.

-Michael
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
In reply to this post by Isiah Meadows-2
Well, it's still context-free (contex-sensitive may have terminals on LHS too, JS cannot). It requires extra cover grammar, and additional static semantics, but still potentially can be parsed by a LALR/CLR parser - yes, you'll have to do parsed nodes reinterpretation in semantics handlers (e.g. validate and transform parsed ObjectLiteral into an ObjectPattern when parsing destructuring), but should be doable in general. It won't be doable if lookaheafs won't be enough, and only a backtracking can recover (like the cases with Flow parser).

Dmitry


On Wed, Jan 11, 2017 at 6:09 PM Isiah Meadows <[hidden email]> wrote:

Heuristically, I doubt it's even context-free at this point, considering the concept and widespread prevalence of early errors now. I suspect it's mildly context-sensitive (maybe tree-adjoining?), but I'm no formal language expert here.




On Wed, Jan 11, 2017, 14:23 Dmitry Soshnikov <[hidden email]> wrote:
One yet has to check the pure ES6+ grammar to be LR(1) compatible (the Cover grammar with nodes reinterpretations makes it harder), but for a very practical use-case, e.g. combined with Flow, it'll not be easy to have an automated parser (since Flow's parser uses backtracking for some edge cases which is already not LR-parser). 

I have some subset of ES6 working as an example language for Syntax tool[1] (since I just took a bunch of productions from JS), but I haven't tried the full ES6 grammar (not having time to work on this project, since it'll be not a straightforward process of just matching the grammar 1:1 to the parser generator -- based on the above reasons).

Long time ago I also had a brief discussion with authors of another popular parser generators for JS. Both didn't try it as well, but mentioned that even for ES5 it was hard to parse e.g. regexp, etc.

Someone (with enough time for the project) should just sit, and spend some time actually porting the grammar from the spec on some powerful parsers generator.


On Wed, Jan 11, 2017 at 10:28 AM, Michael Dyck <[hidden email]> wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)





If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:


  - modify the grammar (before the parser is constructed),


  - modify the construction of the parser, and/or


  - modify the operation of the parser.


It's not clear to me that any of these (or any combination of these) will work.





Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?





-Michael


_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss





_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss





_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Isiah Meadows-2

Okay. I see now. How do early errors factor into this?


On Wed, Jan 11, 2017, 21:49 Dmitry Soshnikov <[hidden email]> wrote:
Well, it's still context-free (contex-sensitive may have terminals on LHS too, JS cannot). It requires extra cover grammar, and additional static semantics, but still potentially can be parsed by a LALR/CLR parser - yes, you'll have to do parsed nodes reinterpretation in semantics handlers (e.g. validate and transform parsed ObjectLiteral into an ObjectPattern when parsing destructuring), but should be doable in general. It won't be doable if lookaheafs won't be enough, and only a backtracking can recover (like the cases with Flow parser).

Dmitry


On Wed, Jan 11, 2017 at 6:09 PM Isiah Meadows <[hidden email]> wrote:

Heuristically, I doubt it's even context-free at this point, considering the concept and widespread prevalence of early errors now. I suspect it's mildly context-sensitive (maybe tree-adjoining?), but I'm no formal language expert here.




On Wed, Jan 11, 2017, 14:23 Dmitry Soshnikov <[hidden email]> wrote:
One yet has to check the pure ES6+ grammar to be LR(1) compatible (the Cover grammar with nodes reinterpretations makes it harder), but for a very practical use-case, e.g. combined with Flow, it'll not be easy to have an automated parser (since Flow's parser uses backtracking for some edge cases which is already not LR-parser). 

I have some subset of ES6 working as an example language for Syntax tool[1] (since I just took a bunch of productions from JS), but I haven't tried the full ES6 grammar (not having time to work on this project, since it'll be not a straightforward process of just matching the grammar 1:1 to the parser generator -- based on the above reasons).

Long time ago I also had a brief discussion with authors of another popular parser generators for JS. Both didn't try it as well, but mentioned that even for ES5 it was hard to parse e.g. regexp, etc.

Someone (with enough time for the project) should just sit, and spend some time actually porting the grammar from the spec on some powerful parsers generator.


On Wed, Jan 11, 2017 at 10:28 AM, Michael Dyck <[hidden email]> wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)





If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:


  - modify the grammar (before the parser is constructed),


  - modify the construction of the parser, and/or


  - modify the operation of the parser.


It's not clear to me that any of these (or any combination of these) will work.





Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?





-Michael


_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss





_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss





_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
The "early errors" are just parser errors which are enforced not by grammar rules, but by additional validation algorithms which follow the parsed node. E.g. `eval` cannot be assigned in strict mode: it's not a grammar issue (grammar allows `eval` on LHS), and it's not a runtime issue, it's an "early error" enforced by static semantics, which still issues it as a `ParseError`. Example in handling it in a semantic action for a production (on a simplified `AssignmentExpression`):

```
%%

AssignmentExpression
  : IDENTIFIER '=' Expression {
      const id = $1;
      
      // Static semantics: early parser error:

      if (mode.isStrict && (id === 'eval' || id == 'arguments')) {
        throw new ParseError(
          '`eval` and `arguments` cannot be assigned in strict mode.'
        );
      }

      return {
        type: 'AssigmentExpression',
        left: $1,
        right: $3,
      };
    }
```

Similarly other "early errors".

BTW, here is the example I mentioned above of how Flow has to do backtracking in parsing in some edge cases, making the grammar completely not LR(1) anymore: https://github.com/babel/babel/issues/1991#issuecomment-121708242

Dmitry


On Wed, Jan 11, 2017 at 7:13 PM, Isiah Meadows <[hidden email]> wrote:

Okay. I see now. How do early errors factor into this?


On Wed, Jan 11, 2017, 21:49 Dmitry Soshnikov <[hidden email]> wrote:
Well, it's still context-free (contex-sensitive may have terminals on LHS too, JS cannot). It requires extra cover grammar, and additional static semantics, but still potentially can be parsed by a LALR/CLR parser - yes, you'll have to do parsed nodes reinterpretation in semantics handlers (e.g. validate and transform parsed ObjectLiteral into an ObjectPattern when parsing destructuring), but should be doable in general. It won't be doable if lookaheafs won't be enough, and only a backtracking can recover (like the cases with Flow parser).

Dmitry


On Wed, Jan 11, 2017 at 6:09 PM Isiah Meadows <[hidden email]> wrote:

Heuristically, I doubt it's even context-free at this point, considering the concept and widespread prevalence of early errors now. I suspect it's mildly context-sensitive (maybe tree-adjoining?), but I'm no formal language expert here.




On Wed, Jan 11, 2017, 14:23 Dmitry Soshnikov <[hidden email]> wrote:
One yet has to check the pure ES6+ grammar to be LR(1) compatible (the Cover grammar with nodes reinterpretations makes it harder), but for a very practical use-case, e.g. combined with Flow, it'll not be easy to have an automated parser (since Flow's parser uses backtracking for some edge cases which is already not LR-parser). 

I have some subset of ES6 working as an example language for Syntax tool[1] (since I just took a bunch of productions from JS), but I haven't tried the full ES6 grammar (not having time to work on this project, since it'll be not a straightforward process of just matching the grammar 1:1 to the parser generator -- based on the above reasons).

Long time ago I also had a brief discussion with authors of another popular parser generators for JS. Both didn't try it as well, but mentioned that even for ES5 it was hard to parse e.g. regexp, etc.

Someone (with enough time for the project) should just sit, and spend some time actually porting the grammar from the spec on some powerful parsers generator.


On Wed, Jan 11, 2017 at 10:28 AM, Michael Dyck <[hidden email]> wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)





If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:


  - modify the grammar (before the parser is constructed),


  - modify the construction of the parser, and/or


  - modify the operation of the parser.


It's not clear to me that any of these (or any combination of these) will work.





Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?





-Michael


_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss





_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss






_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Isiah Meadows-2

Okay. I see now. Lots of backtracking in the case of arrow functions and ternary expressions vs type annotations.


On Thu, Jan 12, 2017, 00:45 Dmitry Soshnikov <[hidden email]> wrote:
The "early errors" are just parser errors which are enforced not by grammar rules, but by additional validation algorithms which follow the parsed node. E.g. `eval` cannot be assigned in strict mode: it's not a grammar issue (grammar allows `eval` on LHS), and it's not a runtime issue, it's an "early error" enforced by static semantics, which still issues it as a `ParseError`. Example in handling it in a semantic action for a production (on a simplified `AssignmentExpression`):

```
%%

AssignmentExpression
  : IDENTIFIER '=' Expression {
      const id = $1;
      
      // Static semantics: early parser error:

      if (mode.isStrict && (id === 'eval' || id == 'arguments')) {
        throw new ParseError(
          '`eval` and `arguments` cannot be assigned in strict mode.'
        );
      }

      return {
        type: 'AssigmentExpression',
        left: $1,
        right: $3,
      };
    }
```

Similarly other "early errors".

BTW, here is the example I mentioned above of how Flow has to do backtracking in parsing in some edge cases, making the grammar completely not LR(1) anymore: https://github.com/babel/babel/issues/1991#issuecomment-121708242

Dmitry


On Wed, Jan 11, 2017 at 7:13 PM, Isiah Meadows <[hidden email]> wrote:

Okay. I see now. How do early errors factor into this?


On Wed, Jan 11, 2017, 21:49 Dmitry Soshnikov <[hidden email]> wrote:
Well, it's still context-free (contex-sensitive may have terminals on LHS too, JS cannot). It requires extra cover grammar, and additional static semantics, but still potentially can be parsed by a LALR/CLR parser - yes, you'll have to do parsed nodes reinterpretation in semantics handlers (e.g. validate and transform parsed ObjectLiteral into an ObjectPattern when parsing destructuring), but should be doable in general. It won't be doable if lookaheafs won't be enough, and only a backtracking can recover (like the cases with Flow parser).

Dmitry


On Wed, Jan 11, 2017 at 6:09 PM Isiah Meadows <[hidden email]> wrote:

Heuristically, I doubt it's even context-free at this point, considering the concept and widespread prevalence of early errors now. I suspect it's mildly context-sensitive (maybe tree-adjoining?), but I'm no formal language expert here.




On Wed, Jan 11, 2017, 14:23 Dmitry Soshnikov <[hidden email]> wrote:
One yet has to check the pure ES6+ grammar to be LR(1) compatible (the Cover grammar with nodes reinterpretations makes it harder), but for a very practical use-case, e.g. combined with Flow, it'll not be easy to have an automated parser (since Flow's parser uses backtracking for some edge cases which is already not LR-parser). 

I have some subset of ES6 working as an example language for Syntax tool[1] (since I just took a bunch of productions from JS), but I haven't tried the full ES6 grammar (not having time to work on this project, since it'll be not a straightforward process of just matching the grammar 1:1 to the parser generator -- based on the above reasons).

Long time ago I also had a brief discussion with authors of another popular parser generators for JS. Both didn't try it as well, but mentioned that even for ES5 it was hard to parse e.g. regexp, etc.

Someone (with enough time for the project) should just sit, and spend some time actually porting the grammar from the spec on some powerful parsers generator.


On Wed, Jan 11, 2017 at 10:28 AM, Michael Dyck <[hidden email]> wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)





If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:


  - modify the grammar (before the parser is constructed),


  - modify the construction of the parser, and/or


  - modify the operation of the parser.


It's not clear to me that any of these (or any combination of these) will work.





Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?





-Michael


_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss





_______________________________________________


es-discuss mailing list


[hidden email]


https://mail.mozilla.org/listinfo/es-discuss






_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Waldemar Horwat
In reply to this post by Michael Dyck
On 01/11/2017 10:28, Michael Dyck wrote:
> In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)
>
> If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:
>   - modify the grammar (before the parser is constructed),
>   - modify the construction of the parser, and/or
>   - modify the operation of the parser.
> It's not clear to me that any of these (or any combination of these) will work.
>
> Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I'm the source of the LR(1) condition.  I've been doing that ever since ECMAScript Edition 3, and in fact am using the LR parser to help design and debug the spec.  I have an implementation of the parser with a few extensions to the LR grammar, including support for parametrized productions, lookahead restrictions, and lexer-parser feedback used to disambiguate things such as what token / will start.  I validate various things such as making sure that there is no place where both an identifier and a regular expression can occur.

     Waldemar

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov

On Mon, Jan 23, 2017 at 5:24 PM, Waldemar Horwat <[hidden email]> wrote:
On 01/11/2017 10:28, Michael Dyck wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:
  - modify the grammar (before the parser is constructed),
  - modify the construction of the parser, and/or
  - modify the operation of the parser.
It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I'm the source of the LR(1) condition.  I've been doing that ever since ECMAScript Edition 3, and in fact am using the LR parser to help design and debug the spec.  I have an implementation of the parser with a few extensions to the LR grammar, including support for parametrized productions, lookahead restrictions, and lexer-parser feedback used to disambiguate things such as what token / will start.  I validate various things such as making sure that there is no place where both an identifier and a regular expression can occur.


Thanks for the confirmation. 

lookahead restrictions

Yeah, all the lookahead restrictions, such as `lookahead ∉ {{, function}` add a lot of extra "No<Lookahead>" productions, which makes the grammar (and hence the parsing table) bigger unfortunately. Example is e.g. `ExpressionNoIn`, `VariableDeclarationNoIn` from the spec, and similar for `...NoLeftCurrly`, and `...NoFunction` to support those lookahead restrictions, and allow `ExpressionStatement` to be parsed without "reduce/reduce" conflicts.

That's why in the mentioned "sub-set" of the ECMAScript example language [1] I chose not to use the same syntax for function expression and function declaration: `fn foo() {}` for declaration, and just an arrow function for function-expressions :) -- no ambiguity, no lookahead restrictions, hence no larger grammar, hence a faster parser.

 > lexer-parser feedback

Yes, having a lexer/parser state solves it more elegantly. Also, if a parser tool allows semantic action not only at the end (i.e on reduce), but for each symbol on RHS (i.e. "S-attributed" definition, and not "L-attributed"), handling the `lookahead ∉ {{, function}` could be even easier: on matching the `{` from the statement position, one can track the parser state that the block should be parsed, and not an object literal. This probably can be solved with lexer state too -- to avoid again a bunch of extra "No<lookahead>" productions. And such cases are simpler in manual recursive decent.

[1] Example lang based on subset of ECMAScript grammar: https://github.com/DmitrySoshnikov/syntax/blob/master/examples/lang.bnf

Dmitry

 
    Waldemar


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
In reply to this post by Waldemar Horwat
On 17-01-23 08:24 PM, Waldemar Horwat wrote:

> On 01/11/2017 10:28, Michael Dyck wrote:
>>
>> If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <!
>> terminals]) fit into the LR approach. It seems like there are three
>> possibilities:
>>   - modify the grammar (before the parser is constructed),
>>   - modify the construction of the parser, and/or
>>   - modify the operation of the parser.
>> It's not clear to me that any of these (or any combination of these) will
>> work.
>>
>> Has anyone algorithmically generated an LR parser from the spec grammar?
>> If so, how did you you handle lookahead-restrictions?
>
> I have an implementation of the parser with a few extensions to the LR
> grammar, including support for ... lookahead  restrictions, ...

Great! So how do you handle lookahead restrictions?

-Michael

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
On Tue, Jan 24, 2017 at 8:23 AM, Michael Dyck <[hidden email]> wrote:
On 17-01-23 08:24 PM, Waldemar Horwat wrote:
On 01/11/2017 10:28, Michael Dyck wrote:

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <!
terminals]) fit into the LR approach. It seems like there are three
possibilities:
  - modify the grammar (before the parser is constructed),
  - modify the construction of the parser, and/or
  - modify the operation of the parser.
It's not clear to me that any of these (or any combination of these) will
work.

Has anyone algorithmically generated an LR parser from the spec grammar?
If so, how did you you handle lookahead-restrictions?

I have an implementation of the parser with a few extensions to the LR
grammar, including support for ... lookahead  restrictions, ...

Great! So how do you handle lookahead restrictions?


As mentioned above, there are two ways to handle lookahead restrictions in LR-parser:

1. Using "No<Lookahead>" productions. This unfortunately may double number of productions in some sub-grammar. E.g. to handle Block vs. ObjectLiteral in ECMAScript, one needs to double all the expression productions to correctly parse `ExpressionStatement`. As an example in the spec itself, take a look e.g. at `RelationalExpression` and `RelationalExpressionNoIn`, and how it repeats (doubles) all the productions but with "NoIn" suffix https://es5.github.io/#x11.8
 
2. Another approach is to use Cover grammar. It's when you parse a node with two possible ways, and enforce (reinterpret) needed sub-nodes in the cover-grammar post-processor (after having parsed a generic node). E.g. I chose this approach to parse generically `ObjectLiteral`, and `Block` to handle `{` lookahead restriction for `ExpressionStatement`. This approach allows not to double grammar (hence making parsing table smaller), however adds some post-processing overhead. https://github.com/DmitrySoshnikov/syntax/blob/master/examples/lang.bnf#L206-L222

P.S.: there is actually option #3 -- to use "S-attributed" parser actions, it's when you may execute handling action not only at the end of production (i.e. not only on "reduce"), but after each element on RHS:

```
statement
options
{
k = 1 ;
}
: { input.LA(1) == LBRACE }? block
| statementTail
;
statementTail
: variableStatement
| emptyStatement
| expressionStatement
        ...
```

This option is probably the best alternative.

Dmitry

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:

>
> As mentioned above, there are two ways to handle lookahead restrictions in
> LR-parser:
>
> 1. Using "No<Lookahead>" productions. This unfortunately may double number
> of productions in some sub-grammar. E.g. to handle Block vs. ObjectLiteral
> in ECMAScript, one needs to double all the expression productions to
> correctly parse `ExpressionStatement`. As an example in the spec itself,
> take a look e.g. at `RelationalExpression` and `RelationalExpressionNoIn`,
> and how it repeats (doubles) all the productions but with "NoIn" suffix
> https://es5.github.io/#x11.8

So, e.g., one could take the production
     IterationStatement : ...  [lookahead != 'let'] LHSExpr ...
and transform it into
     IterationStatement : ... LHSExpr-that-does-not-start-with-let ...
and then generate productions for the new nonterminal, e.g.
     LHSExpr-that-does-not-start-with-let :
         New-Expression-that-does-not-start-with-let
         Call-Expression-that-does-not-start-with-let
etc, (eventually bottoming out when a RHS of the original grammar will
either always or never start with 'let').

The problem with this approach is that, in general, a lookahead-restriction
is not a restriction on the nonterminal that immediately follows it in the
production, it's a restriction on the next few input elements, which might
be generated by more than just that nonterminal. I'm not saying this is
insurmountable, but it does seem to get messy fairly quickly.

And it's unclear how this approach would handle a lookahead-restriction at
the end of a production.


> 2. Another approach is to use Cover grammar.

Is there a way to automatically (a) construct the cover grammar and (b)
reconstruct the parse tree as if parsed by the original grammar, or do you
code it manually?


> P.S.: there is actually option #3 -- to use "S-attributed" parser actions,
> it's when you may execute handling action not only at the end of production
> (i.e. not only on "reduce"), but after each element on RHS:

I'm doubtful this would help. I wouldn't expect the LR-parser-generator to
'understand' the content of actions, so it can't use them to alter the
generation of the parser, and so the LR automaton will have lots of
conflicts (because it can't 'separate' the alternatives that the
lookahead-restrictions are there to separate).

-Michael

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov

On Mon, Jan 30, 2017 at 9:26 AM, Michael Dyck <[hidden email]> wrote:
On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:

As mentioned above, there are two ways to handle lookahead restrictions in
LR-parser:

1. Using "No<Lookahead>" productions. This unfortunately may double number
of productions in some sub-grammar. E.g. to handle Block vs. ObjectLiteral
in ECMAScript, one needs to double all the expression productions to
correctly parse `ExpressionStatement`. As an example in the spec itself,
take a look e.g. at `RelationalExpression` and `RelationalExpressionNoIn`,
and how it repeats (doubles) all the productions but with "NoIn" suffix
https://es5.github.io/#x11.8

So, e.g., one could take the production
    IterationStatement : ...  [lookahead != 'let'] LHSExpr ...
and transform it into
    IterationStatement : ... LHSExpr-that-does-not-start-with-let ...
and then generate productions for the new nonterminal, e.g.
    LHSExpr-that-does-not-start-with-let :
        New-Expression-that-does-not-start-with-let
        Call-Expression-that-does-not-start-with-let
etc, (eventually bottoming out when a RHS of the original grammar will either always or never start with 'let').



Correct, and this is how ES spec itself handles all those "No<lookahead>" duplicated productions.

 
The problem with this approach is that, in general, a lookahead-restriction is not a restriction on the nonterminal that immediately follows it in the production, it's a restriction on the next few input elements, which might be generated by more than just that nonterminal.

Well, in this case it's not a "lookahead" anymore. You could have k=2,3,...N for lookahead in parser generators, but it's not like parse the whole non-terminal, and then see a lookahead after it. This "parse whole non-terminal till the end" might be parse the whole program. In this case a cover grammar is simpler, and that's why it's used starting since ES6 spec. Take a look e.g. at `CoverParenthesizedExpressionAndArrowParameterList` from ES2015.


By just looking at:

```
(x, y)
```

you can't tell whether it's a grouping operator (`SequenceExpression` node), or a beginning of a lambda function (i.e. its parameters node). So in the generic `ParenthisizedExpression` production, you reinterpret it either to lambda parameters, or keep just a grouping operator, if there is no `LambdaTail`. Here it's a function of course, though for this you have to parse `(x, y)` not as `LambdaParameters`, and not as `SequenceExpression`, but as a generic `ParenthisizedExpression`, and using Cover grammar:

```
(x, y) -> { return x + y; }
```

 
I'm not saying this is insurmountable, but it does seem to get messy fairly quickly.

And it's unclear how this approach would handle a lookahead-restriction at the end of a production.

Yes, it adds a bunch of duplicated productions to the grammar, which on practice may increases number of parsing states simnifically. As a real practical example: ES5 spec in LALR(1) mode would increase from ~350 states to 470.
 



2. Another approach is to use Cover grammar.

Is there a way to automatically (a) construct the cover grammar and (b) reconstruct the parse tree as if parsed by the original grammar, or do you code it manually?


Usually it's a manual process of designing a grammar, and defining post-processing static semantics. Take a look again in the example language I gave above, it have couple of case of cover grammar.
 

P.S.: there is actually option #3 -- to use "S-attributed" parser actions,
it's when you may execute handling action not only at the end of production
(i.e. not only on "reduce"), but after each element on RHS:

I'm doubtful this would help. I wouldn't expect the LR-parser-generator to 'understand' the content of actions, so it can't use them to alter the generation of the parser, and so the LR automaton will have lots of conflicts (because it can't 'separate' the alternatives that the lookahead-restrictions are there to separate).


Oh, it'll work for sure, depending on how powerful a parser generator is. Yes, technically it'll be a "reduce-reduce" conflict in the table, but parser will never enter this state based on the semantic action executed *before* entering some non-terminal (not after). In this "before" action you can communicated to lexer, and check the lookahead.

Dmitry
 

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
On 17-01-30 03:00 PM, Dmitry Soshnikov wrote:

>
> On Mon, Jan 30, 2017 at 9:26 AM, Michael Dyck <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:
>
>
>         As mentioned above, there are two ways to handle lookahead
>         restrictions in
>         LR-parser:
>
>         1. Using "No<Lookahead>" productions.
>
>     So, e.g., one could take the production
>         IterationStatement : ...  [lookahead != 'let'] LHSExpr ...
>     and transform it into
>         IterationStatement : ... LHSExpr-that-does-not-start-with-let ...
>     and then generate productions for the new nonterminal, e.g.
>         LHSExpr-that-does-not-start-with-let :
>             New-Expression-that-does-not-start-with-let
>             Call-Expression-that-does-not-start-with-let
>     etc, (eventually bottoming out when a RHS of the original grammar will
>     either always or never start with 'let').
>
> Correct, and this is how ES spec itself handles all those "No<lookahead>"
> duplicated productions.

If you're talking about the 'NoIn' productions in ES5, that seems only
marginally relevant, since they don't address a lookahead problem (i.e., a
problem that could have been solved via a lookahead-restriction).

>     The problem with this approach is that, in general, a
>     lookahead-restriction is not a restriction on the nonterminal that
>     immediately follows it in the production, it's a restriction on the next
>     few input elements, which might be generated by more than just that
>     nonterminal.
>
>
> Well, in this case it's not a "lookahead" anymore.

It isn't a "lookahead" in the resulting grammar, but it is in the original,
and the two grammars should define the same language. What I'm saying is
that it's incorrect to assume that the effect of the lookahead (in the
original) can be confined to the nonterminal that immediately follows, which
is what the suggested transformation does.

> You could have k=2,3,...N for lookahead in parser generators, but it's
> not like parse the whole non-terminal, and then see a lookahead after it.
> This "parse whole non-terminal till the end" might be parse the whole
> program.

I don't really follow what you're suggesting there.


> In this case a cover grammar is simpler, and that's why it's used
> starting since ES6 spec.

But note that none of the ES5 lookahead-restrictions was replaced with a
cover grammar in ES6. They're all still lookahead-restrictions.

>
>     I'm not saying this is insurmountable, but it does seem to get messy
>     fairly quickly.
>
>     And it's unclear how this approach would handle a lookahead-restriction
>     at the end of a production.
>
>
> Yes, it adds a bunch of duplicated productions to the grammar, which on
> practice may increases number of parsing states simnifically. As a real
> practical example: ES5 spec in LALR(1) mode would increase from ~350 states
> to 470.

By "messy", I wasn't talking about the duplicated productions, but rather
the process of generating them.


>         2. Another approach is to use Cover grammar.
>
>
>     Is there a way to automatically (a) construct the cover grammar and (b)
>     reconstruct the parse tree as if parsed by the original grammar, or do
>     you code it manually?
>
>
> Usually it's a manual process of designing a grammar, and defining
> post-processing static semantics.

Okay, I'm not really interested then. I'm looking for an algorithm (i.e., a
modification of the LR algorithm) that starts with a
grammar-with-lookahead-restrictions and ends with an LR parser.


>         P.S.: there is actually option #3 -- to use "S-attributed" parser
>         actions,
>         it's when you may execute handling action not only at the end of
>         production
>         (i.e. not only on "reduce"), but after each element on RHS:
>
>
>     I'm doubtful this would help. I wouldn't expect the LR-parser-generator
>     to 'understand' the content of actions, so it can't use them to alter
>     the generation of the parser, and so the LR automaton will have lots of
>     conflicts (because it can't 'separate' the alternatives that the
>     lookahead-restrictions are there to separate).
>
>
> Oh, it'll work for sure, depending on how powerful a parser generator is.
> Yes, technically it'll be a "reduce-reduce" conflict in the table,

So when your generated parser has conflicts, is it easy to tell which you
don't have to worry about and which you do?

> but parser will never enter this state based on the semantic action executed
> *before* entering some non-terminal (not after). In this "before" action you
> can communicated to lexer, and check the lookahead.

It seems to me that the parser *will* enter such a state, but always via
only one of the conflicting 'paths' (the other(s) being prevented at
parse-time by the semantic actions). So only one of the conflicting actions
will be valid, but you've still got to figure out which one that is.

-Michael
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
On Mon, Jan 30, 2017 at 2:22 PM, Michael Dyck <[hidden email]> wrote:
On 17-01-30 03:00 PM, Dmitry Soshnikov wrote:

On Mon, Jan 30, 2017 at 9:26 AM, Michael Dyck <[hidden email]
<mailto:[hidden email]>> wrote:

    On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:


        As mentioned above, there are two ways to handle lookahead
        restrictions in
        LR-parser:

        1. Using "No<Lookahead>" productions.

    So, e.g., one could take the production
        IterationStatement : ...  [lookahead != 'let'] LHSExpr ...
    and transform it into
        IterationStatement : ... LHSExpr-that-does-not-start-with-let ...
    and then generate productions for the new nonterminal, e.g.
        LHSExpr-that-does-not-start-with-let :
            New-Expression-that-does-not-start-with-let
            Call-Expression-that-does-not-start-with-let
    etc, (eventually bottoming out when a RHS of the original grammar will
    either always or never start with 'let').

Correct, and this is how ES spec itself handles all those "No<lookahead>"
duplicated productions.

If you're talking about the 'NoIn' productions in ES5, that seems only marginally relevant, since they don't address a lookahead problem (i.e., a problem that could have been solved via a lookahead-restriction).

    The problem with this approach is that, in general, a
    lookahead-restriction is not a restriction on the nonterminal that
    immediately follows it in the production, it's a restriction on the next
    few input elements, which might be generated by more than just that
    nonterminal.


Well, in this case it's not a "lookahead" anymore.

It isn't a "lookahead" in the resulting grammar, but it is in the original, and the two grammars should define the same language. What I'm saying is that it's incorrect to assume that the effect of the lookahead (in the original) can be confined to the nonterminal that immediately follows, which is what the suggested transformation does.

A "lookahead" is a terminal token. A non-terminal is not a "lookahead".

To reiterate, imaging you have:

```
• (x, y)
```

and 

```
• (x, y) -> { return x + y; }
```

where `•` is a cursor position. Your lookaheads are "(", "x", "," "y", etc. Usually 1 lookahead token is enough to route the parser, in this case "(".

What you're talking that you need not these lookaheads (at the beginning of the cursor), but *after* the parsed non-terminal (i.e. after fully parsed `(x, y)`), i.e. you're looking for "->", which goes at the end of the non-terminal. This is not possible, since it's not a lookahead at that specific cursor position. It will be a lookahead after you have parsed the non-terminal corresponding to "(x, y)".

So you can't have two non-terminals for this, since it'll be a "reduce/reduce" conflict, and you need a cover grammar here.

 


You could have k=2,3,...N for lookahead in parser generators, but it's
not like parse the whole non-terminal, and then see a lookahead after it.
This "parse whole non-terminal till the end" might be parse the whole
program.

I don't really follow what you're suggesting there.

Like mentioned above:

```
• (x, y) -> { return x + y; }
```

In order to determine whether it's a grouping operator, or arrow function params, it's not enough analyzing a "lookahead" being that the cursor position. You need to actually fully parse this non-terminal, and advance to this position:

```
(x, y) • -> { return x + y; }
```

and from there you may already analyze a real lookahead. However, the problem is the "(x, y)" might the full program (imagine million of entries on several GBs file source code -- what kind of "lookahead" can you analyze here? :))

 



In this case a cover grammar is simpler, and that's why it's used
starting since ES6 spec.

But note that none of the ES5 lookahead-restrictions was replaced with a cover grammar in ES6. They're all still lookahead-restrictions.

Sure, because it's normally possible to do with the "No<Lookahead>" doubled productions. So this actually answers your initial question (it's a real practice).
 



    I'm not saying this is insurmountable, but it does seem to get messy
    fairly quickly.

    And it's unclear how this approach would handle a lookahead-restriction
    at the end of a production.


Yes, it adds a bunch of duplicated productions to the grammar, which on
practice may increases number of parsing states simnifically. As a real
practical example: ES5 spec in LALR(1) mode would increase from ~350 states
to 470.

By "messy", I wasn't talking about the duplicated productions, but rather the process of generating them.


        2. Another approach is to use Cover grammar.


    Is there a way to automatically (a) construct the cover grammar and (b)
    reconstruct the parse tree as if parsed by the original grammar, or do
    you code it manually?


Usually it's a manual process of designing a grammar, and defining
post-processing static semantics.

Okay, I'm not really interested then. I'm looking for an algorithm (i.e., a modification of the LR algorithm) that starts with a grammar-with-lookahead-restrictions and ends with an LR parser.

OK, yeah, at this point I think we already clarified that it is possible to parse ES grammar, and handle it's lookahead restrictions with several techniques described. If you're interested in generic LR parsing, feel free to contact me (we can omit generic off-topic discussions on this list).

 
Dmitry

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
On 17-01-30 06:20 PM, Dmitry Soshnikov wrote:

> On Mon, Jan 30, 2017 at 2:22 PM, Michael Dyck <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>                 1. Using "No<Lookahead>" productions.
>
>             The problem with this approach is that, in general, a
>             lookahead-restriction is not a restriction on the nonterminal that
>             immediately follows it in the production, it's a restriction on
>             the next few input elements, which might be generated by more
>             than just that nonterminal.
>
>         Well, in this case it's not a "lookahead" anymore.
>
>     It isn't a "lookahead" in the resulting grammar, but it is in the
>     original, and the two grammars should define the same language. What I'm
>     saying is that it's incorrect to assume that the effect of the lookahead
>     (in the original) can be confined to the nonterminal that immediately
>     follows, which is what the suggested transformation does.
>
>
> A "lookahead" is a terminal token. A non-terminal is not a "lookahead".

Okay, so then in my previous statement, read "lookahead" as
"lookahead-restriction". (But I did say "lookahead-restriction" in the
statement before that.)


> To reiterate, imaging you have:
>
> ```
> • (x, y)
> ```
>
> and
>
> ```
> • (x, y) -> { return x + y; }
> ```
>
> where `•` is a cursor position. Your lookaheads are "(", "x", "," "y", etc.
> Usually 1 lookahead token is enough to route the parser, in this case "(".
>
> What you're talking that you need not these lookaheads (at the beginning of
> the cursor), but *after* the parsed non-terminal (i.e. after fully parsed
> `(x, y)`), i.e. you're looking for "->", which goes at the end of the
> non-terminal.

I'm not saying I don't need the lookahead tokens right after the cursor. But
I am saying that the lookahead tokens that the hypothetical parser needs to
examine (in order to enforce a lookahead-restriction), even if it's only 1
or 2, might extend past the 'next' nonterminal.

But that's an awkward way to express my objection.

To recap, I'm talking about the suggestion to replace:
     [lookahead != <some sequence of terminals>] Nonterminal
in a production in the original grammar with
     Nonterminal-that-does-not-start-with-that-sequence
(and a bunch of new productions) in the transformed grammar, and whether
that's a valid transformation.

I'm saying it isn't valid when the Nonterminal can derive phrases with fewer
terminals than the prohibited lookahead-sequence. (E.g., if Nonterminal is
nullable, or if it can derive a phrase of length 1 when the prohibited
lookahead-sequence is length 2.)

> This is not possible, since it's not a lookahead at that
> specific cursor position. It will be a lookahead after you have parsed the
> non-terminal corresponding to "(x, y)".
>
> So you can't have two non-terminals for this, since it'll be a
> "reduce/reduce" conflict, and you need a cover grammar here.

Well, I'm saying you'd run into problems before you even tried to construct
the LR automaton, but I think you're agreeing with me that this approach is
not generally applicable?


>         In this case a cover grammar is simpler, and that's why it's used
>         starting since ES6 spec.
>
>     But note that none of the ES5 lookahead-restrictions was replaced with a
>     cover grammar in ES6. They're all still lookahead-restrictions.
>
> Sure, because it's normally possible to do with the "No<Lookahead>" doubled
> productions.

But they weren't replaced with "No" productions (or grammatical parameters)
either! All the lookahead-restrictions in ES5 are still
lookahead-restrictions in the current draft.

> So this actually answers your initial question (it's a real practice).

No, I don't think it does.

----------------------------

>                 2. Another approach is to use Cover grammar.
>
>         Usually it's a manual process of designing a grammar, and defining
>         post-processing static semantics.
>
>     Okay, I'm not really interested then. I'm looking for an algorithm
>     (i.e., a modification of the LR algorithm) that starts with a
>     grammar-with-lookahead-restrictions and ends with an LR parser.
>
> OK, yeah, at this point I think we already clarified that it is possible to
> parse ES grammar,  and handle it's lookahead restrictions with several
> techniques described.

Well, ES implementations exist, so clearly there are techniques for handling
those aspects of the ES grammar. But that doesn't mean that any of them are
parsing *according to* the ES grammar. E.g., it's unlikely that they're
finding exactly the parse tree indicated by the ES grammar. (Nor is there
any requirement that they do so.)

----

By way of contrast, consider "opt" subscripts and grammatical parameters. If
you look up the LR parser-construction algorithm, it doesn't talk about how
to handle such things, but the spec says how to transform them into plain
context-free productions, which the LR algorithm can handle. So it's an easy
shorthand to say that the ES grammar is LR(1) iff the 'expanded' grammar is
LR(1).

But the spec doesn't give that kind of CFG-equivalence for
lookahead-restrictions. Until there's agreement about what they mean in a
CFG or LR sense, it isn't really meaningful to ask if the ES grammar is LR(1).

-Michael

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
The spec is a formal language definition (including for syntactic grammar). It doesn't talk about specifics of implementation much, nor it's supposed to do so (so demanding this from spec isn't appropriate).

It doesn't tell one to implement an LR(1), neither it describes the techniques *how* one going to do so (even though as was confirmed in this thread it's LR(1)-parsable). It just uses some formal notation for this, e.g. `lookahead ∉ {{, function}`. Specific techniques *how exactly* you going to achieve this, is not the purpose of the specification.

(but it's possible to clarify such techniques on discussions which we have now -- please note though, the techniques described above are not an "assumption", or a "guess", they are actual techniques used on practice to achieve this in LR(1)).

The spec in definitions is not consistent though. In one place it uses explicit *technique* for lookahead restriction via the "No<lookahead>" productions. In some cases it just uses formal description as  `lookahead ∉ {{, function}`, leaving a specific technique to achieve this up to implementations.

In fact (for consistency), the rule from ES5:

```
for ( ExpressionNoIn ; Expression ; Expression ) Statement
```

could be described as:

```
for ( `lookahead ∉ {in}` Expression ; Expression ; Expression ) Statement
```

and would mean the same.

You may search for ES5 LR(1) construction (there are several over internets), they use the described above techniques for `lookahead ∉ {{, function}` (adding `NoCurlyBrace`, or `NoFunction` productions). That's also said, it could be rewritten using Cover grammar which adds though post-processing overhead -- that might be the reason why some of the `No<lookahead>` productions were removed from ES6 spec, but `lookahead ∉ {{, function}` is still there. More complex cases like Arrow functions params is still implemented as a Cover grammar.

Dmitry


On Tue, Jan 31, 2017 at 7:56 AM, Michael Dyck <[hidden email]> wrote:
On 17-01-30 06:20 PM, Dmitry Soshnikov wrote:
On Mon, Jan 30, 2017 at 2:22 PM, Michael Dyck <[hidden email]
<mailto:[hidden email]>> wrote:


                1. Using "No<Lookahead>" productions.

            The problem with this approach is that, in general, a
            lookahead-restriction is not a restriction on the nonterminal that
            immediately follows it in the production, it's a restriction on
            the next few input elements, which might be generated by more
            than just that nonterminal.

        Well, in this case it's not a "lookahead" anymore.

    It isn't a "lookahead" in the resulting grammar, but it is in the
    original, and the two grammars should define the same language. What I'm
    saying is that it's incorrect to assume that the effect of the lookahead
    (in the original) can be confined to the nonterminal that immediately
    follows, which is what the suggested transformation does.


A "lookahead" is a terminal token. A non-terminal is not a "lookahead".

Okay, so then in my previous statement, read "lookahead" as "lookahead-restriction". (But I did say "lookahead-restriction" in the statement before that.)


To reiterate, imaging you have:

```
• (x, y)
```

and

```
• (x, y) -> { return x + y; }
```

where `•` is a cursor position. Your lookaheads are "(", "x", "," "y", etc.
Usually 1 lookahead token is enough to route the parser, in this case "(".

What you're talking that you need not these lookaheads (at the beginning of
the cursor), but *after* the parsed non-terminal (i.e. after fully parsed
`(x, y)`), i.e. you're looking for "->", which goes at the end of the
non-terminal.

I'm not saying I don't need the lookahead tokens right after the cursor. But I am saying that the lookahead tokens that the hypothetical parser needs to examine (in order to enforce a lookahead-restriction), even if it's only 1 or 2, might extend past the 'next' nonterminal.

But that's an awkward way to express my objection.

To recap, I'm talking about the suggestion to replace:
    [lookahead != <some sequence of terminals>] Nonterminal
in a production in the original grammar with
    Nonterminal-that-does-not-start-with-that-sequence
(and a bunch of new productions) in the transformed grammar, and whether that's a valid transformation.

I'm saying it isn't valid when the Nonterminal can derive phrases with fewer terminals than the prohibited lookahead-sequence. (E.g., if Nonterminal is nullable, or if it can derive a phrase of length 1 when the prohibited lookahead-sequence is length 2.)

This is not possible, since it's not a lookahead at that
specific cursor position. It will be a lookahead after you have parsed the
non-terminal corresponding to "(x, y)".

So you can't have two non-terminals for this, since it'll be a
"reduce/reduce" conflict, and you need a cover grammar here.

Well, I'm saying you'd run into problems before you even tried to construct the LR automaton, but I think you're agreeing with me that this approach is not generally applicable?


        In this case a cover grammar is simpler, and that's why it's used
        starting since ES6 spec.

    But note that none of the ES5 lookahead-restrictions was replaced with a
    cover grammar in ES6. They're all still lookahead-restrictions.

Sure, because it's normally possible to do with the "No<Lookahead>" doubled
productions.

But they weren't replaced with "No" productions (or grammatical parameters) either! All the lookahead-restrictions in ES5 are still lookahead-restrictions in the current draft.

So this actually answers your initial question (it's a real practice).

No, I don't think it does.

----------------------------

                2. Another approach is to use Cover grammar.

        Usually it's a manual process of designing a grammar, and defining
        post-processing static semantics.

    Okay, I'm not really interested then. I'm looking for an algorithm
    (i.e., a modification of the LR algorithm) that starts with a
    grammar-with-lookahead-restrictions and ends with an LR parser.

OK, yeah, at this point I think we already clarified that it is possible to
parse ES grammar,  and handle it's lookahead restrictions with several
techniques described.

Well, ES implementations exist, so clearly there are techniques for handling those aspects of the ES grammar. But that doesn't mean that any of them are parsing *according to* the ES grammar. E.g., it's unlikely that they're finding exactly the parse tree indicated by the ES grammar. (Nor is there any requirement that they do so.)

----

By way of contrast, consider "opt" subscripts and grammatical parameters. If you look up the LR parser-construction algorithm, it doesn't talk about how to handle such things, but the spec says how to transform them into plain context-free productions, which the LR algorithm can handle. So it's an easy shorthand to say that the ES grammar is LR(1) iff the 'expanded' grammar is LR(1).

But the spec doesn't give that kind of CFG-equivalence for lookahead-restrictions. Until there's agreement about what they mean in a CFG or LR sense, it isn't really meaningful to ask if the ES grammar is LR(1).

-Michael



_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
On 17-01-31 02:17 PM, Dmitry Soshnikov wrote:
> The spec is a formal language definition (including for syntactic grammar).

(Well, the extent to which the spec is formal is debatable. The grammar is
the most formal part of it, but that's true of most programming language
definitions.)

> It doesn't talk about specifics of implementation much, nor it's supposed to
> do so (so demanding this from spec isn't appropriate).

Right (and I'm pretty sure I didn't make such a demand).

> It doesn't tell one to implement an LR(1),

Nor should it.

> neither it describes the
> techniques *how* one going to do so (even though as was confirmed in this
> thread it's LR(1)-parsable). It just uses some formal notation for this,
> e.g. `lookahead ∉ {{, function}`. Specific techniques *how exactly* you
> going to achieve this, is not the purpose of the specification.

Right.

> (but it's possible to clarify such techniques on discussions which we have
> now -- please note though, the techniques described above are not an
> "assumption", or a "guess", they are actual techniques used on practice to
> achieve this in LR(1)).

I'm not disputing the actualness of the techniques. I'm saying:

  - technique #1 (duplicated productions): As stated, looks insufficient for
the ES grammar. Which may just mean that there's a more complicated version
of the approach that *is* sufficient for ES.

  - technique #2 (cover grammar): Can you algorithmically (a) construct the
cover grammar and (b) recover the parse according to the original grammar?

  - technique #3 (parse-time actions): In the LR automaton, can you
algorithmically distinguish action-avoided conflicts from 'true' conflicts?


> The spec in definitions is not consistent though. In one place it uses
> explicit *technique* for lookahead restriction via the "No<lookahead>"
> productions. In some cases it just uses formal description as  `lookahead ∉
> {{, function}`, leaving a specific technique to achieve this up to
> implementations.
>
> In fact (for consistency), the rule from ES5:
>
> ```
> for ( ExpressionNoIn ; Expression ; Expression ) Statement
> ```
>
> could be described as:
>
> ```
> for ( `lookahead ∉ {in}` Expression ; Expression ; Expression ) Statement
> ```
>
> and would mean the same.

No, they wouldn't. ExpressionNoIn (or Expression[~In]) is a version of
Expression that doesn't ("openly") use the 'in' operator (at the
RelationalExpression level), i.e. it doesn't have the 'in' operator
somewhere in the *middle*, whereas
    [lookahead ∉ {in}] Expression
prohibits an Expression that *starts* with the token 'in'. You can't express
the 'NoIn' restriction with any finite amount of lookahead, because a
RelationalExpression can be arbitrarily large.


> You may search for ES5 LR(1) construction (there are several over
> internets), they use the described above techniques for `lookahead ∉ {{,
> function}` (adding `NoCurlyBrace`, or `NoFunction` productions).

But the chances are low that they've been algorithmically generated from the
ES spec. And while it's conceivable I could figure out how to do it
algorithmically by examining examples of it done manually, I'm doubtful of
that too.

> That's also said, it could be rewritten using Cover grammar which adds
> though post-processing overhead -- that might be the reason why some of
> the `No<lookahead>` productions were removed from ES6 spec,

What removed productions are you thinking of? ES5's "NoIn" productions
became ES6+'s "[In]" parameter (which is basically the same thing with a
more compact notation). And ES5's "NoDash" productions are still in the
current draft.

-Michael

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
On Wed, Feb 1, 2017 at 7:37 AM, Michael Dyck <[hidden email]> wrote:
On 17-01-31 02:17 PM, Dmitry Soshnikov wrote:
The spec is a formal language definition (including for syntactic grammar).

(Well, the extent to which the spec is formal is debatable. The grammar is the most formal part of it, but that's true of most programming language definitions.)

It doesn't talk about specifics of implementation much, nor it's supposed to
do so (so demanding this from spec isn't appropriate).

Right (and I'm pretty sure I didn't make such a demand).

It doesn't tell one to implement an LR(1),

Nor should it.

neither it describes the
techniques *how* one going to do so (even though as was confirmed in this
thread it's LR(1)-parsable). It just uses some formal notation for this,
e.g. `lookahead ∉ {{, function}`. Specific techniques *how exactly* you
going to achieve this, is not the purpose of the specification.

Right.


OK, I mostly meant that a spec can be fully abstract, even in abstract definitions themselves, and use different notations instead of providing fully "CFG-equivalence for lookahead restrictions".
 


(but it's possible to clarify such techniques on discussions which we have
now -- please note though, the techniques described above are not an
"assumption", or a "guess", they are actual techniques used on practice to
achieve this in LR(1)).

I'm not disputing the actualness of the techniques. I'm saying:

 - technique #1 (duplicated productions): As stated, looks insufficient for the ES grammar. Which may just mean that there's a more complicated version of the approach that *is* sufficient for ES.


Here's an example: https://raw.githubusercontent.com/zaach/jison/master/examples/jscore.jison (see how lookahead restriction for `{` in `ExpressionStatement` is implemented using `NoBF` productions).

 

 - technique #2 (cover grammar): Can you algorithmically (a) construct the cover grammar and (b) recover the parse according to the original grammar?

And this approach I chose (instead of duplicating productions) to handle the same `{` lookahead restriction, what this comment explains: https://github.com/DmitrySoshnikov/syntax/blob/master/examples/lang.bnf#L206-L222

As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

 

 - technique #3 (parse-time actions): In the LR automaton, can you algorithmically distinguish action-avoided conflicts from 'true' conflicts?


The spec in definitions is not consistent though. In one place it uses
explicit *technique* for lookahead restriction via the "No<lookahead>"
productions. In some cases it just uses formal description as  `lookahead ∉
{{, function}`, leaving a specific technique to achieve this up to
implementations.

In fact (for consistency), the rule from ES5:

```
for ( ExpressionNoIn ; Expression ; Expression ) Statement
```

could be described as:

```
for ( `lookahead ∉ {in}` Expression ; Expression ; Expression ) Statement
```

and would mean the same.

No, they wouldn't. ExpressionNoIn (or Expression[~In]) is a version of Expression that doesn't ("openly") use the 'in' operator (at the RelationalExpression level), i.e. it doesn't have the 'in' operator somewhere in the *middle*, whereas
   [lookahead ∉ {in}] Expression
prohibits an Expression that *starts* with the token 'in'. You can't express the 'NoIn' restriction with any finite amount of lookahead, because a RelationalExpression can be arbitrarily large.


I agree, the example not the best, I was leaning to, that a spec could use even more abstract definition for this instead of explicit productions.

 

You may search for ES5 LR(1) construction (there are several over
internets), they use the described above techniques for `lookahead ∉ {{,
function}` (adding `NoCurlyBrace`, or `NoFunction` productions).

But the chances are low that they've been algorithmically generated from the ES spec. And while it's conceivable I could figure out how to do it algorithmically by examining examples of it done manually, I'm doubtful of that too.


I'm not sure about "algorithmic" transformations, but yeah, how the grammar is described in the spec, and which extra transformation you have to do in the resulting LR(1), might be different things. Does it mean the spec doesn't define an LR(1) grammar? Perhaps, but it doesn't really matter, since again the spec uses formal notation, and for shortness may omit actual techniques.

 
That's also said, it could be rewritten using Cover grammar which adds
though post-processing overhead -- that might be the reason why some of
the `No<lookahead>` productions were removed from ES6 spec,


Yeah, seems most of them were replaced with inverted notation (using [In] suffix) -- still leaving the interpretation of this abstract notation up to implementations.

P.S:

All this reduces to that practically today it might be easier to do just a manual recursive decent/accent. Yes, it's a lot of recursion, yes, it might be slower than table-driven parser. But gives you much control, and avoid all those described techniques (hence a manual recursive descent is chosen on practice by most of the implementations today). Or, the syntax of a language could be different itself, and avoid ambiguities.

Dmitry

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
12