Identifying ECMAScript identifiers

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Identifying ECMAScript identifiers

Norbert Lindenberg
ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.

While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.

Thanks,
Norbert

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

gaz Heyes
You forgot to include MentalJS. I can parse 120k identifier in 5ms on Firefox on my crappy machine. My method is much faster than any of the parsers you listed and I handle unicode escapes too.
http://businessinfo.co.uk/labs/MentalJS/MentalJS.html

On 8 March 2013 07:35, Norbert Lindenberg <[hidden email]> wrote:
ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.

While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.

Thanks,
Norbert

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Yusuke Suzuki
These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
 
Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier identification functions.
I wrote simple UnicodeData.txt parser and generated RegExp[1]. These functions are also used in Acorn.

In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3].
So I suggest accepting a code point number as an argument.



On Fri, Mar 8, 2013 at 6:42 PM, gaz Heyes <[hidden email]> wrote:
You forgot to include MentalJS. I can parse 120k identifier in 5ms on Firefox on my crappy machine. My method is much faster than any of the parsers you listed and I handle unicode escapes too.
http://businessinfo.co.uk/labs/MentalJS/MentalJS.html


On 8 March 2013 07:35, Norbert Lindenberg <[hidden email]> wrote:
ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.

While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.

Thanks,
Norbert

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss




--
Regards,
Yusuke Suzuki

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

gaz Heyes
On 8 March 2013 10:35, Yusuke SUZUKI <[hidden email]> wrote:
Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier identification functions.
I wrote simple UnicodeData.txt parser and generated RegExp[1]. These functions are also used in Acorn.

RegEx is slower. I suggest using if statements on char codes and < and > to check it's within the range of z-a etc and then separate functions to handle higher ascii variables only when needed and then compare the char codes are within the ranges of allowed identifiers.

https://code.google.com/p/mentaljs/source/browse/trunk/MentalJS/javascript/Mental.js#504

I still have to optimize that function further by removing <= and >= and maybe separating each identifier range into their own function since higher non-alpha variables take longer to parse since they are at the end of the if statement.  

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Norbert Lindenberg
In reply to this post by Yusuke Suzuki

On Mar 8, 2013, at 2:35 , Yusuke SUZUKI wrote:

> In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3].
> So I suggest accepting a code point number as an argument.

The functions I proposed accept both numbers and strings.

http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

Norbert

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Ariya Hidayat
In reply to this post by gaz Heyes
> RegEx is slower. I suggest using if statements on char codes and < and > to
> check it's within the range of z-a etc ...

If you check Yusuke's links, that is exactly what Esprima is doing.
The use of regular expression is reserved only for the slow/uncommon
code path.


--
Ariya Hidayat, http://ariya.ofilabs.com
http://twitter.com/ariyahidayat
http://gplus.to/ariyahidayat
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Yusuke Suzuki
In reply to this post by Norbert Lindenberg

The functions I proposed accept both numbers and strings.
http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

Ah, I see. I missed what is intended at step 2. Looks very nice, thanks.


On Sat, Mar 9, 2013 at 6:29 AM, Norbert Lindenberg <[hidden email]> wrote:

On Mar 8, 2013, at 2:35 , Yusuke SUZUKI wrote:

> In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3].
> So I suggest accepting a code point number as an argument.

The functions I proposed accept both numbers and strings.

http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

Norbert




--
Regards,
Yusuke Suzuki

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

gaz Heyes
In reply to this post by Ariya Hidayat
On 9 March 2013 01:59, Ariya Hidayat <[hidden email]> wrote:
If you check Yusuke's links, that is exactly what Esprima is doing.
The use of regular expression is reserved only for the slow/uncommon
code path.

Yeah I can see you are converting a char code into a string using fromCharCode and comparing it against a regex which is slower and I showed you a function that checks non-alpha variables using charcodes. BTW your isWhiteSpace function also calls to functions indexOf/fromCharcode when it doesn't need to and also indexOf starts at 0 so you have to check if it's greater than -1.

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Allen Wirfs-Brock
In reply to this post by Norbert Lindenberg
Norbert,

Can you explain why you think these should be  functions on String rather than part of a more general character classification facility that might be associated with some more specialized object?  The latter approach would seem to be to have modularity advantages at both the implementation and usage level.

Allen




On Mar 7, 2013, at 11:35 PM, Norbert Lindenberg wrote:

> ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
>
> While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
>
> I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.
>
> Thanks,
> Norbert
>
> _______________________________________________
> es-discuss mailing list
> [hidden email]
> https://mail.mozilla.org/listinfo/es-discuss
>

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Peter van der Zee
Norbert, for the sake of completeness;

ZeParser (http://github.com/qfox/zeparser) does support complete
unicode identifiers
ZeParser2 (http://github.com/qfox/zeparser2) doesn't (I simply didn't bother)

- peter
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Norbert Lindenberg
In reply to this post by Allen Wirfs-Brock
I added these functions to String because that seems the best place for them in the current arrangement. I'm aware of the proposal to modularize the standard library [1] and can well imagine that these functions will find a better home in that new scheme.

The other character classification scheme I'm looking into is based on Unicode character properties. The reasons why I separated out this proposal are:

- Tools operating on ECMAScript source code need to be aware of the ECMAScript version they use, for syntax, semantics, keywords, and, well, the characters allowed in identifiers. Some tools let their clients specify an ECMAScript version (e.g., "es5" in JSLint and JSHint), others may assume a fixed version. The characters in turn are tied to both Unicode versions and ECMAScript versions - for example, SpiderMonkey currently supports Unicode 6.2 characters, but restricted to the BMP because it hasn't been upgraded to ES6 identifiers yet.

- For Unicode character properties, on the other hand, clients generally need only the properties as of the latest known version, and in the few exceptions that I know of (such as the 2003 version of IDNA) only specific Unicode versions are needed. Requiring that a general API for Unicode character properties provide access to Unicode version-specific information would create a huge burden on implementors, but benefit no-one.

- It's difficult for tools developers to determine the correct set of characters to include as identifier characters. One particular difficulty is that the Unicode general category of a character can change in rare cases, so a character can move into or out of the categories that the ES3/ES5 specifications reference. For compatibility, characters shouldn't move out of the set of characters allowed for identifiers. (It turns out that browsers also get this wrong - all of them). (ES6 solves this problem by basing its identifier definition on Unicode Standard Annex 31, Unicode Identifier and Pattern Syntax, which defines special sets of characters Other_ID_Start and Other_ID_Continue and treats these characters as identifier characters even though their current general categories don't qualify them as such anymore.)

- For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].

So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.

[1] http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard
[2] http://www.unicode.org/reports/tr31/#Backward_Compatibility
[3] http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

Norbert


On Mar 9, 2013, at 9:16 , Allen Wirfs-Brock wrote:

> Norbert,
>
> Can you explain why you think these should be  functions on String rather than part of a more general character classification facility that might be associated with some more specialized object?  The latter approach would seem to be to have modularity advantages at both the implementation and usage level.
>
> Allen
>
>
>
>
> On Mar 7, 2013, at 11:35 PM, Norbert Lindenberg wrote:
>
>> ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
>>
>> While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
>> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
>>
>> I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.
>>
>> Thanks,
>> Norbert
>>
>> _______________________________________________
>> es-discuss mailing list
>> [hidden email]
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Mathias Bynens-2
Great proposal, Norbert!

Another tool that uses JavaScript to identify identifiers as per ECMAScript 5.1 / Unicode 6.2 is http://mothereff.in/js-variables.

For a list of bug reports regarding identifier handling in browsers / JavaScript engines, see http://mathiasbynens.be/notes/javascript-identifiers (look for “Some of these don’t work in all browsers/environments”).

I’m a bit confused by step 7.2, though: “If edition is not 3, 5, or 6, throw a RangeError exception.” Does this mean only integers are accepted? E.g. you can specify `5` as the ECMAScript version, but not `5.1`? I would suggest adding `5.1` to the list (even if it’s just an alias to `5`), but perhaps I’m missing something.

Also, how about adding `String.isIdentifier(string)` as well?

On 12 Mar 2013, at 02:45, Norbert Lindenberg <[hidden email]> wrote:

> So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.

+1

> - For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].

Agreed it would be nice. In the meantime, to polyfill this functionality, tools that take a list of code points / symbols / ranges (like http://mths.be/regenerate) could be used.

Regards,
Mathias

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Identifying ECMAScript identifiers

Mathias Bynens-2
Also, what about the non-reserved words that act like reserved words, i.e. the immutable `NaN`, `Infinity`, and `undefined` properties of the global object, or `eval` and `arguments` which are disallowed as identifiers (see section 12.2.1) in strict mode? IMHO, these are examples of why it would be useful to add a robust `String.isIdentifier` to the proposal.
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss