ES RegExp parser

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ES RegExp parser

Dmitry Soshnikov
I started working on a ECMAScript regular expressions parser[1] with an AST format similar to Mozilla's Parser API. This might later be extended to support more powerful constructs, like lookbehind assertions, multiline regexes, groups naming, comments, etc.

And while this is mostly an FYI post (probably someone will find it useful for regexes analysis in source transformation tools, or source code editors), I'd appreciate any feedback on the specification of AST nodes (currently totally made up by myself).

E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't tell anything (and shouldn't of course) which AST node this quantifier node produces.

This leaves open questions like "whether a quantifier should be a part of the parsed expression, or should it vice-versa be a main node itself, and have the expression as a sub-node?":

In other words, which format is more appropriate (taking into account AST traversal tools in order to implement NFA/DFA engine for it later):

```
/a+?/
```

Char is main, quantifier is a sub-node:

```
{
  type: 'Char',
  value: 'a',
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

The quantifier is main (creating `Repetition` AST node), char is the `expression` sub-node:


```
{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a'
  },
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

Currently I chose the second approach (with `Repetition` node) as more practical when building an AST traversal -- it may have `onRepetition` generic handler, and call `onChar` internally for its `expression`, instead of making `onChar` (or any other node) to check, and handle its `quantifier`, and do a repeat.

Anyways, if you have any thought or feedback on AST nodes format, please feel free to contact me.

Dmitry


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES RegExp parser

Jason Orendorff
The second approach, hands down.

1. With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check `re.quantifier` anywhere, your code will run, but you have a bug. Much easier to only have to check `re.type`.

2. If you have a regexp `re` and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write `{type: '+', expression: re}` than to have to examine `re.quantifier` and (if it's already present) figure out how to modify it.

3. With the first approach, you don't have to represent `(?:)` group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: `/(?:\d{4,6})?/`

To me this is not even a question.

-j



On Sun, Mar 19, 2017 at 4:52 PM, Dmitry Soshnikov <[hidden email]> wrote:
I started working on a ECMAScript regular expressions parser[1] with an AST format similar to Mozilla's Parser API. This might later be extended to support more powerful constructs, like lookbehind assertions, multiline regexes, groups naming, comments, etc.

And while this is mostly an FYI post (probably someone will find it useful for regexes analysis in source transformation tools, or source code editors), I'd appreciate any feedback on the specification of AST nodes (currently totally made up by myself).

E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't tell anything (and shouldn't of course) which AST node this quantifier node produces.

This leaves open questions like "whether a quantifier should be a part of the parsed expression, or should it vice-versa be a main node itself, and have the expression as a sub-node?":

In other words, which format is more appropriate (taking into account AST traversal tools in order to implement NFA/DFA engine for it later):

```
/a+?/
```

Char is main, quantifier is a sub-node:

```
{
  type: 'Char',
  value: 'a',
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

The quantifier is main (creating `Repetition` AST node), char is the `expression` sub-node:


```
{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a'
  },
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

Currently I chose the second approach (with `Repetition` node) as more practical when building an AST traversal -- it may have `onRepetition` generic handler, and call `onChar` internally for its `expression`, instead of making `onChar` (or any other node) to check, and handle its `quantifier`, and do a repeat.

Anyways, if you have any thought or feedback on AST nodes format, please feel free to contact me.

Dmitry


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss



_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES RegExp parser

Jason Orendorff
> `{type: '+', expression: re}`

Sorry, I guess the real expression would be something more like:

    {
      type: 'Repetition',
      expression: re,
      quantifier: {
        type: '+',
        greedy: true
      }
    }

The point stands.

-j


On Mon, Mar 20, 2017 at 10:36 AM, Jason Orendorff <[hidden email]> wrote:
The second approach, hands down.

1. With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check `re.quantifier` anywhere, your code will run, but you have a bug. Much easier to only have to check `re.type`.

2. If you have a regexp `re` and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write `{type: '+', expression: re}` than to have to examine `re.quantifier` and (if it's already present) figure out how to modify it.

3. With the first approach, you don't have to represent `(?:)` group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: `/(?:\d{4,6})?/`

To me this is not even a question.

-j



On Sun, Mar 19, 2017 at 4:52 PM, Dmitry Soshnikov <[hidden email]> wrote:
I started working on a ECMAScript regular expressions parser[1] with an AST format similar to Mozilla's Parser API. This might later be extended to support more powerful constructs, like lookbehind assertions, multiline regexes, groups naming, comments, etc.

And while this is mostly an FYI post (probably someone will find it useful for regexes analysis in source transformation tools, or source code editors), I'd appreciate any feedback on the specification of AST nodes (currently totally made up by myself).

E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't tell anything (and shouldn't of course) which AST node this quantifier node produces.

This leaves open questions like "whether a quantifier should be a part of the parsed expression, or should it vice-versa be a main node itself, and have the expression as a sub-node?":

In other words, which format is more appropriate (taking into account AST traversal tools in order to implement NFA/DFA engine for it later):

```
/a+?/
```

Char is main, quantifier is a sub-node:

```
{
  type: 'Char',
  value: 'a',
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

The quantifier is main (creating `Repetition` AST node), char is the `expression` sub-node:


```
{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a'
  },
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

Currently I chose the second approach (with `Repetition` node) as more practical when building an AST traversal -- it may have `onRepetition` generic handler, and call `onChar` internally for its `expression`, instead of making `onChar` (or any other node) to check, and handle its `quantifier`, and do a repeat.

Anyways, if you have any thought or feedback on AST nodes format, please feel free to contact me.

Dmitry


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss




_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES RegExp parser

Dmitry Soshnikov
In reply to this post by Jason Orendorff

On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <[hidden email]> wrote:
The second approach, hands down.

1. With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check `re.quantifier` anywhere, your code will run, but you have a bug. Much easier to only have to check `re.type`.

2. If you have a regexp `re` and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write `{type: '+', expression: re}` than to have to examine `re.quantifier` and (if it's already present) figure out how to modify it.

3. With the first approach, you don't have to represent `(?:)` group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: `/(?:\d{4,6})?/`

To me this is not even a question.

Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.

Dmitry


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES RegExp parser

Dmitry Soshnikov
On Mon, Mar 20, 2017 at 9:19 AM, Dmitry Soshnikov <[hidden email]> wrote:

On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <[hidden email]> wrote:
The second approach, hands down.

1. With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check `re.quantifier` anywhere, your code will run, but you have a bug. Much easier to only have to check `re.type`.

2. If you have a regexp `re` and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write `{type: '+', expression: re}` than to have to examine `re.quantifier` and (if it's already present) figure out how to modify it.

3. With the first approach, you don't have to represent `(?:)` group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: `/(?:\d{4,6})?/`

To me this is not even a question.

Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.


OK, I added docs and specs for AST node types, which can also be a good learning material: https://www.npmjs.com/package/regexp-tree#ast-nodes-specification

Any feedback is welcome!

Parsing regexes is fun :) with this you realize that these are completely valid regexp patterns:

```
/^^^^$$$/.test(''); // true

/$^/.test(''); // true

/[---]/.test('-'); // true, a range from '-' to '-'!
/[-]/.test('-'); // true
``` 

Dmitry

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Loading...