LR(1) grammar/parser and lookahead-restrictions

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Waldemar Horwat
On 02/01/2017 10:25, Dmitry Soshnikov wrote:
> As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

Cover grammars are an ugly hack that we had to add when there was no other practical choice.  Avoid them as much as possible.  They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.

When designing the grammar, the preferences are:

- Use standard LR(1) productions
- Use parametrized productions
- Use lookahead restrictions if parametrized productions would introduce too many parameters into rules
- Use a cover grammar if the grammar can't be reasonably expressed in any other way.  They're a last resort with lots of problems.

Lookahead restrictions fit very well into an LR(1) engine as long as they're limited to only one token, and that's what I've implemented in the validator.  You need to be very careful with them if looking more than one token ahead because lexing of the tokens can vary based on context.  For example, if the next few characters in front of the cursor are )/abc/i+, then what is the second token?  What is the third token?  It's actually context-dependent.

The same problem is even worse for cover grammars.

     Waldemar

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <[hidden email]> wrote:
On 02/01/2017 10:25, Dmitry Soshnikov wrote:
As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

Cover grammars are an ugly hack that we had to add when there was no other practical choice.  Avoid them as much as possible.  They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.


Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse `{` as a `Block` vs. `ObjectLiteral`, and handle `ExpressionStatement` is to make the resulting grammar short, and don't introduce a bunch of `NoCurly`, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between `{` and `}` are properties, or vice-versa, statements, based on the expression/statement position.

Practically, a cover grammar still can be slower because of such post-processing overhead (imaging you have a giant file with just object -- all those now are needed to be validated to contain property nodes). OTOH, a trade off with lookahead restrictions is introducing ~1.5x more parsing states.
 
When designing the grammar, the preferences are:

- Use standard LR(1) productions
- Use parametrized productions
- Use lookahead restrictions if parametrized productions would introduce too many parameters into rules
- Use a cover grammar if the grammar can't be reasonably expressed in any other way.  They're a last resort with lots of problems.


Just to double-check, by "parametrized productions" you mean the `No<lookahead>` extra productions?
 
Lookahead restrictions fit very well into an LR(1) engine as long as they're limited to only one token, and that's what I've implemented in the validator.  You need to be very careful with them if looking more than one token ahead because lexing of the tokens can vary based on context.  For example, if the next few characters in front of the cursor are )/abc/i+, then what is the second token?  What is the third token?  It's actually context-dependent.

The same problem is even worse for cover grammars.


Yeah, that's also true. Thanks again for the confirmation, that such a validator even exists, it's good -- at least we know it's parsable in principle. Those, that's said, on real practice now implementing a manual recursive descent for ECMAScript is more approachable now.

Dmitry 

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
On Thu, Feb 2, 2017 at 3:56 PM, Dmitry Soshnikov <[hidden email]> wrote:
On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <[hidden email]> wrote:
On 02/01/2017 10:25, Dmitry Soshnikov wrote:
As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

Cover grammars are an ugly hack that we had to add when there was no other practical choice.  Avoid them as much as possible.  They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.


Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse `{` as a `Block` vs. `ObjectLiteral`, and handle `ExpressionStatement` is to make the resulting grammar short, and don't introduce a bunch of `NoCurly`, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between `{` and `}` are properties, or vice-versa, statements, based on the expression/statement position.

Practically, a cover grammar still can be slower because of such post-processing overhead (imaging you have a giant file with just object -- all those now are needed to be validated to contain property nodes). OTOH, a trade off with lookahead restrictions is introducing ~1.5x more parsing states.
 
When designing the grammar, the preferences are:

- Use standard LR(1) productions
- Use parametrized productions
- Use lookahead restrictions if parametrized productions would introduce too many parameters into rules
- Use a cover grammar if the grammar can't be reasonably expressed in any other way.  They're a last resort with lots of problems.


Just to double-check, by "parametrized productions" you mean the `No<lookahead>` extra productions?
 

Or you actually mean "lookahead-parametrized" productions (like in SLR(1) which doesn't use it, vs. LALR(1), or CLR(1) which do use it)?
 
Dmitry

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
In reply to this post by Waldemar Horwat
On 17-02-02 06:23 PM, Waldemar Horwat wrote:
>
> Lookahead restrictions fit very well into an LR(1) engine

Again: Great, but how? E.g., do you pre-process the grammar, modify the
construction of the automaton, and/or modify the operation of the parser?

> as long as they're limited to only one token, and that's what I've
> implemented in the validator.

So when the grammar has a prohibited lookahead-sequence with more than one
token (in ExpressionStatement, IterationStatement, and ExportDeclaration),
does the validator just use the first token?

> You need to be very careful with them if looking more than one token
> ahead because lexing of the tokens can vary based on context. For
> example, if the next few characters in front of the cursor are )/abc/i+,
> then what is the second token? What is the third token? It's actually
> context-dependent.

But the context-dependentness of lexing is a parse-time problem, not a
validation-time problem, right? The syntactic level can just assume a stream
of (correctly lexed) input elements. Or do you validate deeper than the
syntactic grammar?

(And it seems to me that multi-token lookahead-restrictions would be hard
for a validator to handle even if lexing *wasn't* context-dependent, but
maybe that depends on how you handle them.)

-Michael
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
In reply to this post by Dmitry Soshnikov
On 17-02-02 06:56 PM, Dmitry Soshnikov wrote:
>
> Just to double-check, by "parametrized productions" you mean the
> `No<lookahead>` extra productions?

See the 'Grammar Notation' section of ES6+, at the paragraph beginning
"A production may be parameterized by a subscripted annotation".

They were introduced in draft 20 of ES6, and *replaced* the use of the
"NoIn" productions of the previous draft (and ES3-5).

(which, strictly speaking, weren't "No<lookahead> productions", because they
didn't express a restriction on lookahead)

-Michael
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Waldemar Horwat
In reply to this post by Dmitry Soshnikov
On 02/02/2017 15:56, Dmitry Soshnikov wrote:

> On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On 02/01/2017 10:25, Dmitry Soshnikov wrote:
>
>         As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).
>
>
>     Cover grammars are an ugly hack that we had to add when there was no other practical choice.  Avoid them as much as possible.  They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.
>
>
> Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse `{` as a `Block` vs. `ObjectLiteral`, and handle `ExpressionStatement` is to make the resulting grammar short, and don't introduce a bunch of `NoCurly`, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between `{` and `}` are properties, or vice-versa, statements, based on the expression/statement position.

Don't use a cover grammar to unify blocks with object literals.  That's a really bad idea and you'll likely get it wrong.  If the `{` starts a block, then if the next token is a keyword such as `if` then it's parsed as a keyword.  If the `{` starts an object literal, then if the next token is a keyword then it's parsed as an identifiername.  As we extend the language, the expansions of these can later diverge to the point where you won't know whether a `/` starts a regular expression or is a division symbol.

     Waldemar

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Waldemar Horwat
In reply to this post by Michael Dyck
On 02/03/2017 08:17, Michael Dyck wrote:
> On 17-02-02 06:23 PM, Waldemar Horwat wrote:
>>
>> Lookahead restrictions fit very well into an LR(1) engine
>
> Again: Great, but how? E.g., do you pre-process the grammar, modify the construction of the automaton, and/or modify the operation of the parser?

For each state × token combination, the automaton describes what happens when you're in state S and see a token T.  The lookahead restrictions remove possible transitions; without them there would be ambiguities where a given state × token combination would want to do two incompatible things.

That's different from parametrized rules, which simply macro-expand into lots of individual rules.

>> as long as they're limited to only one token, and that's what I've
>> implemented in the validator.
>
> So when the grammar has a prohibited lookahead-sequence with more than one token (in ExpressionStatement, IterationStatement, and ExportDeclaration), does the validator just use the first token?

My LR(1) validator can't actually handle the case of multiple tokens of lookahead, and I didn't feel like doing an LR(2) grammar.  I had to check these by hand.

>> You need to be very careful with them if looking more than one token
>> ahead because lexing of the tokens can vary based on context. For
>> example, if the next few characters in front of the cursor are )/abc/i+,
>> then what is the second token? What is the third token? It's actually
>> context-dependent.
>
> But the context-dependentness of lexing is a parse-time problem, not a validation-time problem, right?

No.

> The syntactic level can just assume a stream of (correctly lexed) input elements.

No!  It's impossible to create a stream of correctly lexed input elements without doing syntactic level parsing.  See the example I gave above.

> Or do you validate deeper than the syntactic grammar?

Yes.  The syntactic grammar controls the lexer.  The validator looks for problems such as the syntactic grammar giving the lexer contradictory instructions.  An example would be any syntactic grammar automaton state where one outgoing syntactic transition would swallow a regexp token and another outgoing syntactic transition from the same state would swallow a / token.  If any such state exists, the grammar is broken.

> (And it seems to me that multi-token lookahead-restrictions would be hard for a validator to handle even if lexing *wasn't* context-dependent, but maybe that depends on how you handle them.)
>
> -Michael
>

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
In reply to this post by Waldemar Horwat
On Fri, Feb 3, 2017 at 2:32 PM, Waldemar Horwat <[hidden email]> wrote:
On 02/02/2017 15:56, Dmitry Soshnikov wrote:
On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <[hidden email] <mailto:[hidden email]>> wrote:

    On 02/01/2017 10:25, Dmitry Soshnikov wrote:

        As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).


    Cover grammars are an ugly hack that we had to add when there was no other practical choice.  Avoid them as much as possible.  They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.


Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse `{` as a `Block` vs. `ObjectLiteral`, and handle `ExpressionStatement` is to make the resulting grammar short, and don't introduce a bunch of `NoCurly`, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between `{` and `}` are properties, or vice-versa, statements, based on the expression/statement position.

Don't use a cover grammar to unify blocks with object literals.  That's a really bad idea and you'll likely get it wrong.  If the `{` starts a block, then if the next token is a keyword such as `if` then it's parsed as a keyword.  If the `{` starts an object literal, then if the next token is a keyword then it's parsed as an identifiername.  As we extend the language, the expansions of these can later diverge to the point where you won't know whether a `/` starts a regular expression or is a division symbol.


Ah, good catch. In fact, this can be fixed by allowing keywords as property names, as in JS (and in fact, I have just fixed it after your comment -- see example code changes in this commit: https://github.com/DmitrySoshnikov/syntax/commit/ccb3029cf7515bc078436c37831e027b50cb7fa4 -- there are no ambiguities, and it's still correctly parsed as an `ObjectLiteral`, or `Block` in needed position).

Just to note, this example language is a smaller sub-set of ECMAScript grammar, but not ES grammar itself, so, so far, Cover grammar works ;) (still being a "dirty hack" since the language syntax was designed so, allowing the same construct in different logically positions).

P.S.:

Another observation to note, using Cover grammar for this specific case with `{` requires parsing at least in LALR(1) mode, and SLR(1) will not work, eventually reaching to "reduce-reduce" conflict for when parsing e.g. an empty block inside the empty `{ { } }`:

(output from the parsing states analysis tool)

```
- BlockStatement -> BlockOrObject • (kernel, reduce by production 13)
- ObjectLiteral -> BlockOrObject • (kernel, reduce by production 68)
```

LALR(1) uses parametrized lookahead set, and is free from this issue. It's the one which is used the most on practice anyways (taking more time for table calculation though)

Dmitry

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
In reply to this post by Waldemar Horwat
On 17-02-03 05:32 PM, Waldemar Horwat wrote:

> On 02/03/2017 08:17, Michael Dyck wrote:
>> On 17-02-02 06:23 PM, Waldemar Horwat wrote:
>>>
>>> Lookahead restrictions fit very well into an LR(1) engine
>>
>> Again: Great, but how? E.g., do you pre-process the grammar, modify the
>> construction of the automaton, and/or modify the operation of the parser?
>
> For each state × token combination, the automaton describes what happens
> when you're in state S and see a token T.  The lookahead restrictions remove
> possible transitions; without them there would be ambiguities where a given
> state × token combination would want to do two incompatible things.

Okay, so do you generate the automaton (ignoring lookahead restrictions) and
then remove transitions (using lookahead restrictions)? Or do you integrate
the lookahead-restrictions into the generation of the automaton?

> That's different from parametrized rules, which simply macro-expand into
> lots of individual rules.

Yup.


>> But the context-dependentness of lexing is a parse-time problem, not a
>> validation-time problem, right?
>
> No.
>
>> The syntactic level can just assume a stream of (correctly lexed) input
>> elements.

(I should have said "*Validation* of the syntactic level [etc]")

> No!  It's impossible to create a stream of correctly lexed input elements
> without doing syntactic level parsing.

I quite agree. I didn't mean to suggest otherwise. What I mean is that, once
you've generated the automaton for the syntactic grammar, you can just look
at each state's set of expected terminals, and from that deduce the goal
symbol that the lexer will need to use to get the next token when the parser
is in that state. The point being that you can do that *after* generating
the syntactic automaton. So the context-dependentness of lexing doesn't have
to affect the process of generating the syntactic automaton.

(That's assuming an LR(1)-ish parser, and an approach where you don't try to
combine the syntactic and lexical grammars to generate a single
[scannerless] automaton. Which may not be your approach.)

> The validator looks for problems such as the syntactic grammar giving the
> lexer contradictory instructions. An example would be any syntactic
> grammar automaton state where one outgoing syntactic transition would
> swallow a regexp token and another outgoing syntactic transition from the
> same state would swallow a / token. If any such state exists, the grammar
> is broken.

Yup, the spec asserts the non-existence of such states ("syntactic grammar
contexts").

-Michael

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Waldemar Horwat
On 02/04/2017 07:20, Michael Dyck wrote:

> On 17-02-03 05:32 PM, Waldemar Horwat wrote:
>> On 02/03/2017 08:17, Michael Dyck wrote:
>>> On 17-02-02 06:23 PM, Waldemar Horwat wrote:
>>>>
>>>> Lookahead restrictions fit very well into an LR(1) engine
>>>
>>> Again: Great, but how? E.g., do you pre-process the grammar, modify the
>>> construction of the automaton, and/or modify the operation of the parser?
>>
>> For each state × token combination, the automaton describes what happens
>> when you're in state S and see a token T.  The lookahead restrictions remove
>> possible transitions; without them there would be ambiguities where a given
>> state × token combination would want to do two incompatible things.
>
> Okay, so do you generate the automaton (ignoring lookahead restrictions) and then remove transitions (using lookahead restrictions)? Or do you integrate the lookahead-restrictions into the generation of the automaton?

It's integrated.  You can't generate a valid automaton without the lookahead restrictions.

>> That's different from parametrized rules, which simply macro-expand into
>> lots of individual rules.
>
> Yup.
>
>
>>> But the context-dependentness of lexing is a parse-time problem, not a
>>> validation-time problem, right?
>>
>> No.
>>
>>> The syntactic level can just assume a stream of (correctly lexed) input
>>> elements.
>
> (I should have said "*Validation* of the syntactic level [etc]")
>
>> No!  It's impossible to create a stream of correctly lexed input elements
>> without doing syntactic level parsing.
>
> I quite agree. I didn't mean to suggest otherwise. What I mean is that, once you've generated the automaton for the syntactic grammar, you can just look at each state's set of expected terminals, and from that deduce the goal symbol that the lexer will need to use to get the next token when the parser is in that state. The point being that you can do that *after* generating the syntactic automaton. So the context-dependentness of lexing doesn't have to affect the process of generating the syntactic automaton.

That's correct.

> (That's assuming an LR(1)-ish parser, and an approach where you don't try to combine the syntactic and lexical grammars to generate a single [scannerless] automaton. Which may not be your approach.)

The parser and lexer stay separate, other than the lexer providing tokens to the parser and the parser selecting one of several top-level lexer goal symbols for lexing the next token.  I do not use any kind of unified parser-lexer grammar; that could run into issues such as the syntactic grammar making lexing non-greedy: a+++++b lexing as a ++ + ++ b instead of the correct a ++ ++ + b (which would then generate a syntax error at the syntactic level).

>> The validator looks for problems such as the syntactic grammar giving the
>> lexer contradictory instructions. An example would be any syntactic
>> grammar automaton state where one outgoing syntactic transition would
>> swallow a regexp token and another outgoing syntactic transition from the
>> same state would swallow a / token. If any such state exists, the grammar
>> is broken.
>
> Yup, the spec asserts the non-existence of such states ("syntactic grammar contexts").
>
> -Michael

     Waldemar


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
On 17-02-06 07:32 PM, Waldemar Horwat wrote:

> On 02/04/2017 07:20, Michael Dyck wrote:
>> On 17-02-03 05:32 PM, Waldemar Horwat wrote:
>>> On 02/03/2017 08:17, Michael Dyck wrote:
>>>> On 17-02-02 06:23 PM, Waldemar Horwat wrote:
>>>>>
>>>>> Lookahead restrictions fit very well into an LR(1) engine
>>>>
>>>> Again: Great, but how? E.g., do you pre-process the grammar, modify the
>>>> construction of the automaton, and/or modify the operation of the parser?
>>>
>>> For each state × token combination, the automaton describes what happens
>>> when you're in state S and see a token T.  The lookahead restrictions remove
>>> possible transitions; without them there would be ambiguities where a given
>>> state × token combination would want to do two incompatible things.
>>
>> Okay, so do you generate the automaton (ignoring lookahead restrictions)
>> and then remove transitions (using lookahead restrictions)? Or do you
>> integrate the lookahead-restrictions into the generation of the automaton?
>
> It's integrated.  You can't generate a valid automaton without the lookahead
> restrictions.

So, when you're making LR items for a production with a
lookahead-restriction (l-r), do you:

-- treat the l-r as a (nullable) symbol, so that you get an item with the
dot before the l-r, and one with the dot after, or

-- treat the l-r as occurring between two adjacent symbols, so that you get
an item where the dot is "at" the l-r, or

-- something else?

-Michael

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Isiah Meadows-2
Just out of curiosity, what's the best way to handle async arrow
function arguments vs `async` function calls? Is it term rewriting, or
is there a better way of handling that? I'm much more familiar with
recursive-descent and LL parsers than LR or shift-reduce ones, so I'm
not as readily familiar with how to handle these cases.
-----

Isiah Meadows
[hidden email]


On Tue, Feb 7, 2017 at 9:39 AM, Michael Dyck <[hidden email]> wrote:

> On 17-02-06 07:32 PM, Waldemar Horwat wrote:
>>
>> On 02/04/2017 07:20, Michael Dyck wrote:
>>>
>>> On 17-02-03 05:32 PM, Waldemar Horwat wrote:
>>>>
>>>> On 02/03/2017 08:17, Michael Dyck wrote:
>>>>>
>>>>> On 17-02-02 06:23 PM, Waldemar Horwat wrote:
>>>>>>
>>>>>>
>>>>>> Lookahead restrictions fit very well into an LR(1) engine
>>>>>
>>>>>
>>>>> Again: Great, but how? E.g., do you pre-process the grammar, modify the
>>>>> construction of the automaton, and/or modify the operation of the
>>>>> parser?
>>>>
>>>>
>>>> For each state × token combination, the automaton describes what happens
>>>> when you're in state S and see a token T.  The lookahead restrictions
>>>> remove
>>>> possible transitions; without them there would be ambiguities where a
>>>> given
>>>> state × token combination would want to do two incompatible things.
>>>
>>>
>>> Okay, so do you generate the automaton (ignoring lookahead restrictions)
>>> and then remove transitions (using lookahead restrictions)? Or do you
>>> integrate the lookahead-restrictions into the generation of the
>>> automaton?
>>
>>
>> It's integrated.  You can't generate a valid automaton without the
>> lookahead
>> restrictions.
>
>
> So, when you're making LR items for a production with a
> lookahead-restriction (l-r), do you:
>
> -- treat the l-r as a (nullable) symbol, so that you get an item with the
> dot before the l-r, and one with the dot after, or
>
> -- treat the l-r as occurring between two adjacent symbols, so that you get
> an item where the dot is "at" the l-r, or
>
> -- something else?
>
>
> -Michael
>
> _______________________________________________
> es-discuss mailing list
> [hidden email]
> https://mail.mozilla.org/listinfo/es-discuss
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Waldemar Horwat
In reply to this post by Michael Dyck
On 02/07/2017 06:39, Michael Dyck wrote:

> On 17-02-06 07:32 PM, Waldemar Horwat wrote:
>> On 02/04/2017 07:20, Michael Dyck wrote:
>>> On 17-02-03 05:32 PM, Waldemar Horwat wrote:
>>>> On 02/03/2017 08:17, Michael Dyck wrote:
>>>>> On 17-02-02 06:23 PM, Waldemar Horwat wrote:
>>>>>>
>>>>>> Lookahead restrictions fit very well into an LR(1) engine
>>>>>
>>>>> Again: Great, but how? E.g., do you pre-process the grammar, modify the
>>>>> construction of the automaton, and/or modify the operation of the parser?
>>>>
>>>> For each state × token combination, the automaton describes what happens
>>>> when you're in state S and see a token T.  The lookahead restrictions remove
>>>> possible transitions; without them there would be ambiguities where a given
>>>> state × token combination would want to do two incompatible things.
>>>
>>> Okay, so do you generate the automaton (ignoring lookahead restrictions)
>>> and then remove transitions (using lookahead restrictions)? Or do you
>>> integrate the lookahead-restrictions into the generation of the automaton?
>>
>> It's integrated.  You can't generate a valid automaton without the lookahead
>> restrictions.
>
> So, when you're making LR items for a production with a lookahead-restriction (l-r), do you:
>
> -- treat the l-r as a (nullable) symbol, so that you get an item with the dot before the l-r, and one with the dot after, or
>
> -- treat the l-r as occurring between two adjacent symbols, so that you get an item where the dot is "at" the l-r, or
>
> -- something else?

It's something else — it's directly tied into the generation of the automaton states.  Each automaton state contains a collection of possible places in the expansions of grammar rules that the state can represent.  Following a terminal symbol T from a state A leads to a state B.  A lookahead restriction prevents state B's collection from including expansions that would have been prohibited by the lookahead restriction on T.  If that generates an inconsistency (for example, if there are two ways to get to an identical state, one with a lookahead restriction and one without), the grammar validation fails.

     Waldemar

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Michael Dyck
On 17-02-07 06:02 PM, Waldemar Horwat wrote:

>
> Lookahead restrictions fit very well into an LR(1) engine [...]
> it's directly tied into the generation of the automaton states. Each
> automaton state contains a collection of possible places in the
> expansions of grammar rules that the state can represent. Following a
> terminal symbol T from a state A leads to a state B. A lookahead
> restriction prevents state B's collection from including expansions that
> would have been prohibited by the lookahead restriction on T. If that
> generates an inconsistency (for example, if there are two ways to get to
> an identical state, one with a lookahead restriction and one without),
> the grammar validation fails.

So that sounds like you form the closure of state A in the standard way, and
then use the lookahead-restriction to limit the transitions out of A (and
the items that A contributes to the kernel of B). But it seems to me that
forming the closure is also affected by the lookahead-restriction. Is that
true in your validator?

-Michael

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: LR(1) grammar/parser and lookahead-restrictions

Dmitry Soshnikov
In reply to this post by Dmitry Soshnikov
On Mon, Jan 23, 2017 at 10:46 PM, Dmitry Soshnikov <[hidden email]> wrote:

On Mon, Jan 23, 2017 at 5:24 PM, Waldemar Horwat <[hidden email]> wrote:
On 01/11/2017 10:28, Michael Dyck wrote:
In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:
  - modify the grammar (before the parser is constructed),
  - modify the construction of the parser, and/or
  - modify the operation of the parser.
It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I'm the source of the LR(1) condition.  I've been doing that ever since ECMAScript Edition 3, and in fact am using the LR parser to help design and debug the spec.  I have an implementation of the parser with a few extensions to the LR grammar, including support for parametrized productions, lookahead restrictions, and lexer-parser feedback used to disambiguate things such as what token / will start.  I validate various things such as making sure that there is no place where both an identifier and a regular expression can occur.


Thanks for the confirmation. 

lookahead restrictions

Yeah, all the lookahead restrictions, such as `lookahead ∉ {{, function}` add a lot of extra "No<Lookahead>" productions, which makes the grammar (and hence the parsing table) bigger unfortunately. Example is e.g. `ExpressionNoIn`, `VariableDeclarationNoIn` from the spec, and similar for `...NoLeftCurrly`, and `...NoFunction` to support those lookahead restrictions, and allow `ExpressionStatement` to be parsed without "reduce/reduce" conflicts.

That's why in the mentioned "sub-set" of the ECMAScript example language [1] I chose not to use the same syntax for function expression and function declaration: `fn foo() {}` for declaration, and just an arrow function for function-expressions :) -- no ambiguity, no lookahead restrictions, hence no larger grammar, hence a faster parser.

 > lexer-parser feedback

Yes, having a lexer/parser state solves it more elegantly.


FYI: I just built this as a proof of concept for ECMAScript's `{` and `}`, and have parsed normally `{ }` in statement position as a `BlockStatement`, and in the expression position as an `ObjectLiteral`. With special "activation productions" one can change the lexer state, and yield different token types for the same characters.



Dmitry

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
12