Q: Lonely surrogates and unicode regexps

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Q: Lonely surrogates and unicode regexps

Marja Hölttä
Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:


E.g.,
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”

But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases:
/foo.bar/u.test("foo\uD83Dbar") == ?
/foo.bar/u.test("foo\uDC00bar") == ?
/foo[^a]bar/u.test("foo\uD83Dbar") == ?
/foo[^a]bar/u.test("foo\uDC00bar") == ?
/foo/u.test("bar\uD83Dbarfoo") == ?
/foo/u.test("bar\uDC00barfoo") == ?
/foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the "lonely surrogate is a character" behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with /u regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the "lonely surrogate is a character" implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for /u regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate).

BR,
Marja


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Mathias Bynens-2

> On 28 Jan 2015, at 11:36, Marja Hölttä <[hidden email]> wrote:
>
> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
>
> The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:
>
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation
>
> E.g.,
> “Let A be the set of all *characters* except LineTerminator.”
> “Let ch be the *character* Input[e].”
>
> But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?
>
> Q1: Are lonely surrogates allowed in /u regexps?
>
> E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?
>
> Suggestion: we shouldn't allow lonely surrogates in /u regexps.
>
> If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.

You’re proposing to define “characters” in terms of Unicode scalar values in the case `/u` is used. I could get behind that — it reinforces the idea that `/u` is like a strict mode for regular expressions.

Playing devil’s advocate, the problem is that regular expressions and strings go hand in hand, and there is no guarantee that JavaScript strings only consist of valid code points. Making `.` not match lone surrogates breaks the developer expectation that `(.)` matches every “part” of the string. Having to avoid `/u` to prevent this seems like a potentially bad thing.

> The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.
>
> Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)
>
> Test cases:
> /foo.bar/u.test("foo\uD83Dbar") == ?
> /foo.bar/u.test("foo\uDC00bar") == ?
> /foo[^a]bar/u.test("foo\uD83Dbar") == ?
> /foo[^a]bar/u.test("foo\uDC00bar") == ?
> /foo/u.test("bar\uD83Dbarfoo") == ?
> /foo/u.test("bar\uDC00barfoo") == ?
> /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair?
> /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this?
>
> Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.
>
> If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.
>
> For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

This is indeed an incomplete solution. The lack of lookbehind support in ES makes this hard to transpile correctly. Ideas welcome!

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Mark Davis ☕
In reply to this post by Marja Hölttä

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <[hidden email]> wrote:
The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:

​Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous term "code point". For example, U+0378​ is not (currently) an encoded character according to Unicode, but it would certainly be a terrible idea to disregard it, or not match it. It is a reserved code point that may be assigned as an encoded character in the future. So both U+D83D and U+0378 are not characters.

If a ES spec uses the term "character" instead of "code point", then at some point in the text it needs to disambiguate what is meant.

As to how this should be handled in regex expressions, I'd suggest looking at Java's approach.


— Il meglio è l’inimico del bene —

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Erik Corry-2
In reply to this post by Mathias Bynens-2


On Wed, Jan 28, 2015 at 11:45 AM, Mathias Bynens <[hidden email]> wrote:

> On 28 Jan 2015, at 11:36, Marja Hölttä <[hidden email]> wrote:
>
> For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

This is indeed an incomplete solution. The lack of lookbehind support in ES makes this hard to transpile correctly. Ideas welcome!

I don't think your transpiler can work without lookbehind.  If you could guarantee that none of your transpiled regexp matches a substring that ends in the middle of a pair, then I think you could get it right without lookbehind, but consider:

"TxL-TxLT".test(/(...)-\1./);

Where L stands for a lead surrogate, and T stands for a trailing surrogate.  There's no way to stop the backreference from swallowing the last L, and without lookbehind there is no way to stop the . from matching the final T.  A second issue is having a match that starts in the middle of a pair. You could test for this after the matching if JS gave you the index of the match in the string, but I don't think it does.

Ignoring the start-of-match-in-the-middle-of-a-pair issue, and the backreferences case, I think you can do without the backreference.  Assuming the lonely-surrogates-are-a-character scenario, the period (.) transpiles to (ignore spaces added for readability):

(?:  \L(?!\T)  | \L\T  |  \T  |  [^\L\T\N])

where \L means leading surrogates, \T means trailing surrogates, \N means all newlines.  Whatever comes before the . is not allowed to match a half

As an optimization, .x can transpile to (?: \L\T | . )x where the x stands in for any literal characters.

For a JS engine implementor, like Marja, it is of course possible to add 1-character negative lookbehind (\b already has elements of this).  Then your in-engine transpiler turns . into

(?:  \L(?!\T)  | \L\T  |  (?<!\L)\T  |  [^\L\T\N])

Which is going to be truly horrible in terms of code size and performance.  It's not like the period operator is a rare thing in a regexp, and other common things like [^a-z] and [^\d] will expand into similar horrors.

On the other hand, in the lonely-surrogates-match-nothing scenario, the . transpiles to

(?: \l\t  |  [^\l\t\n] )

which is quite a lot nicer and faster.  In this scenario, .x expands to (?: \L\T | [^\T\L\N )  which still has no lookaheads and lookbehinds.
 


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

André Bargull-2
In reply to this post by Mark Davis ☕
On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <marja at chromium.org> wrote:

> The ES6 unicode regexp spec is not very clear regarding what should happen
> if the regexp or the matched string contains lonely surrogates (a lead
> surrogate without a trail, or a trail without a lead). For example, for the
> . operator, the relevant parts of the spec speak about characters:
>

​Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous
term "code point". For example, U+0378​ is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term "character" instead of "code point", then at
some point in the text it needs to disambiguate what is meant.

"character" is defined in 21.2.2 Pattern Semantics [1]:

In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point.


[1] https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Wes Garland
In reply to this post by Marja Hölttä
Some interesting questions here.

1 - What is a character? Is it a Unicode Code Point?
2 - Should we be able to match all possible JS Strings?
3 - Should we be able to match all possible Unicode Strings?
4 - What do we do if there is a character in a String we cannot match?
5 - Do unmatchable characters match . ?
6 - Are subsections of unmatchable strings matchable if they contain only matchable characters?

It is important to remember in these discussions that the Unicode specification allows strings which contain unmatched surrogates. Do we want regular expressions that can't match some Unicode strings? Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate?  How about reserved code points?  What happens when they become assigned?


On 28 January 2015 at 05:36, Marja Hölttä <[hidden email]> wrote:
Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:


E.g.,
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”

But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases:
/foo.bar/u.test("foo\uD83Dbar") == ?
/foo.bar/u.test("foo\uDC00bar") == ?
/foo[^a]bar/u.test("foo\uD83Dbar") == ?
/foo[^a]bar/u.test("foo\uDC00bar") == ?
/foo/u.test("bar\uD83Dbarfoo") == ?
/foo/u.test("bar\uDC00barfoo") == ?
/foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the "lonely surrogate is a character" behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with /u regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the "lonely surrogate is a character" implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for /u regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate).

BR,
Marja


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss




--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Mark Davis ☕
In reply to this post by André Bargull-2
Good, that sounds right.



— Il meglio è l’inimico del bene —

On Wed, Jan 28, 2015 at 12:57 PM, André Bargull <[hidden email]> wrote:
On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <marja at chromium.org> wrote:

> The ES6 unicode regexp spec is not very clear regarding what should happen
> if the regexp or the matched string contains lonely surrogates (a lead
> surrogate without a trail, or a trail without a lead). For example, for the
> . operator, the relevant parts of the spec speak about characters:
>

​Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous
term "code point". For example, U+0378​ is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term "character" instead of "code point", then at
some point in the text it needs to disambiguate what is meant.

"character" is defined in 21.2.2 Pattern Semantics [1]:

In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point.


[1] https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Marja Hölttä
In reply to this post by Mark Davis ☕
For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense.


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Mark Davis ☕
I think the cleanest mental model is where UTF-16 or UTF-8 strings are interpreted as if they were transformed into UTF-32.

While that is generally feasible, it often represents a cost in performance which is not acceptable in practice. So you see various approaches that involve some deviation from that mental model.



— Il meglio è l’inimico del bene —

On Wed, Jan 28, 2015 at 2:15 PM, Marja Hölttä <[hidden email]> wrote:
For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense.



_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

André Bargull-2
In reply to this post by Marja Hölttä
> For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
> 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:
>
> foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
> generally, lonely surrogates match /./.
>
> Backreferences are allowed to consume the leading surrogate of a valid
> surrogate pair:
>
> Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1
>
> But surprisingly:
>
> Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$
>
> ... So Ex2 works as if the input string was converted to UTF-32 before
> matching, but Ex1 works as if it was def not. Idk what's the correct mental
> model where both Ex1 and Ex2 would make sense.

java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as
a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.

[1]
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
[2]
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

André Bargull-2
On 1/28/2015 2:51 PM, André Bargull wrote:

>> For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
>> 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:
>>
>> foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
>> generally, lonely surrogates match /./.
>>
>> Backreferences are allowed to consume the leading surrogate of a valid
>> surrogate pair:
>>
>> Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1
>>
>> But surprisingly:
>>
>> Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$
>>
>> ... So Ex2 works as if the input string was converted to UTF-32 before
>> matching, but Ex1 works as if it was def not. Idk what's the correct mental
>> model where both Ex1 and Ex2 would make sense.
>
> java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns
> as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.
>
> [1]
> http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
> [2]
> http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671

Err, the part about how patterns are read is not important here. What I should have written is that
the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00` is read
as a single code point (and not split into \uD834 and \uDC00 during backtracking).

[3]
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Marja Hölttä
Based on Ex1, looks like the input string is not read as a sequence of code points when we try to find a match for \1. So it's mostly read as a sequence of code points except when it's not. :/

On Wed, Jan 28, 2015 at 3:11 PM, André Bargull <[hidden email]> wrote:
On 1/28/2015 2:51 PM, André Bargull wrote:
For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid
surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before
matching, but Ex1 works as if it was def not. Idk what's the correct mental
model where both Ex1 and Ex2 would make sense.

java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.

[1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
[2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671

Err, the part about how patterns are read is not important here. What I should have written is that the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point (and not split into \uD834 and \uDC00 during backtracking).

[3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

André Bargull-2
On 1/28/2015 3:36 PM, Marja Hölttä wrote:
> Based on Ex1, looks like the input string is not read as a sequence of code points when we try to
> find a match for \1. So it's mostly read as a sequence of code points except when it's not. :/

Yep, back references are matched as a sequence of code units. The first link I've posted points to
the relevant method in java.util.regex.Pattern. I've got no idea why it's implemented that way, for
example when you enable case-insensitive matching, back references are no longer matched as a
sequence of code units:

---
int[] flags = { 0, Pattern.CASE_INSENSITIVE, Pattern.UNICODE_CASE,
         Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE };

// Prints true, false, true, false
Arrays.stream(flags).mapToObj(f -> Pattern.compile("foo(.+)bar\\1", f))
         .map(p -> p.matcher("foo\uD834bar\uD834\uDC00").find())
         .forEach(System.out::println);
---



>
> On Wed, Jan 28, 2015 at 3:11 PM, André Bargull <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 1/28/2015 2:51 PM, André Bargull wrote:
>
>             For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
>             1.7.0_65) Pattern.UNICODE_CHARACTER___CLASS works:
>
>             foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
>             generally, lonely surrogates match /./.
>
>             Backreferences are allowed to consume the leading surrogate of a valid
>             surrogate pair:
>
>             Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1
>
>             But surprisingly:
>
>             Ex2: \uDC00foobar\uD834\__uDC00foobar\uD834 doesn't match ^(.+)\1$
>
>             ... So Ex2 works as if the input string was converted to UTF-32 before
>             matching, but Ex1 works as if it was def not. Idk what's the correct mental
>             model where both Ex1 and Ex2 would make sense.
>
>
>         java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads
>         patterns as a sequence of code points [2]. That should help to explain the differences
>         between ex1 and ex2.
>
>         [1]
>         http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l4890
>         <http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890>
>         [2]
>         http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l1671
>         <http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671>
>
>
>     Err, the part about how patterns are read is not important here. What I should have written is
>     that the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00`
>     is read as a single code point (and not split into \uD834 and \uDC00 during backtracking).
>
>     [3]
>     http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l3773
>     <http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773>
>
>
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Allen Wirfs-Brock
In reply to this post by Marja Hölttä

On Jan 28, 2015, at 2:36 AM, Marja Hölttä <[hidden email]> wrote:

Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:

TL;DR: in a unicode regexp lonely surrogates are considered to be a single “character”. 

As André has already covered “character” has a very specific meaning within the context of the ES6 RegExp specification in the second paragraph of  http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . The specification uses the same set of algorithms to describe both BCP (i.e., 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching semantics.  “Character” is used in those algorithm to refer to a single element of the mode that is currently operating within.

I think the ambiguity you find is in step 2.1 of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :

2.  Return an internal closure that takes two arguments, a String str and an integer index, and performs the following:    
1. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character.         

Apparently I don’t have an adequate definition of “interpreted as a UTF-16 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will provided one in the next spec. revisions.  The intended semantics is that:
   In ascending string index order:
Each valid UTF-16 surrogate pair is interpreted as a signal code point that is the UTF-16 encoded value
        Each “lonely” surrogate is interpreted as  single code point that is the surrogate value
        Every other 16-bit code unit is interpreted as a single code point.

Allen








E.g.,
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”

But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases:
/foo.bar/u.test("foo\uD83Dbar") == ?
/foo.bar/u.test("foo\uDC00bar") == ?
/foo[^a]bar/u.test("foo\uD83Dbar") == ?
/foo[^a]bar/u.test("foo\uDC00bar") == ?
/foo/u.test("bar\uD83Dbarfoo") == ?
/foo/u.test("bar\uDC00barfoo") == ?
/foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the "lonely surrogate is a character" behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with /u regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the "lonely surrogate is a character" implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for /u regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate).

BR,
Marja

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Allen Wirfs-Brock
In reply to this post by Mark Davis ☕

On Jan 28, 2015, at 5:26 AM, Mark Davis ☕️ <[hidden email]> wrote:

I think the cleanest mental model is where UTF-16 or UTF-8 strings are interpreted as if they were transformed into UTF-32.

This is exactly the approach used  in the ES6 spec (except that it doesn’t deal with UTF-8)


While that is generally feasible, it often represents a cost in performance which is not acceptable in practice. So you see various approaches that involve some deviation from that mental model.

While ES6 uses this approach in its specification, implementations are free to use any implementation technique that produces the same result.

Allen



_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Marja Hölttä
In reply to this post by Allen Wirfs-Brock
Cool, thanks for clarifications!

To make sure, as per the "intended semantics", we never allow splitting a valid surrogate pair (= matching only one of the surrogates but not the other), and thus we'll differ from the Java implementation here:

/foo(.+)bar\1/u.test("foo\uD834bar\uD834\uDC00"); we say "false", Java says "true".

(In addition, /^(.+)\1$/u.test("\uDC00foobar\uD834\uDC00foobar\uD834") == false.)

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Allen Wirfs-Brock
In reply to this post by Wes Garland

On Jan 28, 2015, at 4:54 AM, Wes Garland <[hidden email]> wrote:

Some interesting questions here.

These aren't discussion points.  These are all things that must have answers that are directly derivable from the ES6 spec.  If, after developing an adequate understand of that part of the specification, you can’t find the answer to these questions then there is probably something that needs to be clarified in the spec.

1 - What is a character? Is it a Unicode Code Point?
defined in: paragraph 2 http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics 
2 - Should we be able to match all possible JS Strings?
yes, there is nothing in the algorithms that restrict JS String values
3 - Should we be able to match all possible Unicode Strings?
yes, subject to what you mean but “Unicode Strings” as within JS Strings supplemental code points must be UTF-16 encoded.
4 - What do we do if there is a character in a String we cannot match?
RegExp.exec returns null if a string cannot be matched by a pattern
5 - Do unmatchable characters match . ?
there is no concept in the specification of an “unmatchable” character
6 - Are subsections of unmatchable strings matchable if they contain only matchable characters?
there is no concept in the specification of an “unmatchable” character

It is important to remember in these discussions that the Unicode specification allows strings which contain unmatched surrogates.
and ES6 //u patterns can match them
Do we want regular expressions that can't match some Unicode strings?
No, the ES6 specificaiton can match all possible strings
Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate?
we already have it: \u{D83D}
  How about reserved code points?  What happens when they become assigned?
Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched.

Allen


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

André Bargull-2
In reply to this post by Marja Hölttä
> Cool, thanks for clarifications!
>
> To make sure, as per the "intended semantics", we never allow splitting a
> valid surrogate pair (= matching only one of the surrogates but not the
> other), and thus we'll differ from the Java implementation here:
>
> /foo(.+)bar\1/u.test("foo\uD834bar\uD834\uDC00"); we say "false", Java says
> "true".

Correct, the captures List entry is [\uD834], so when performing 21.2.2.9 AtomEscape, \uD834 is
matched against \uD834\uDC00 in step 8 which results in a failure state.


>
> (In addition, /^(.+)\1$/u.test("\uDC00foobar\uD834\uDC00foobar\uD834") ==
> false.)

Yes, this expression also returns false.
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Norbert Lindenberg
In reply to this post by Allen Wirfs-Brock

> On Jan 28, 2015, at 8:11 , Allen Wirfs-Brock <[hidden email]> wrote:
>
>
> On Jan 28, 2015, at 2:36 AM, Marja Hölttä <[hidden email]> wrote:
>
>> Hello es-discuss,
>>
>> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
>>
>> The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:
>
> TL;DR: in a unicode regexp lonely surrogates are considered to be a single “character”.
>
> As André has already covered “character” has a very specific meaning within the context of the ES6 RegExp specification in the second paragraph of  http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . The specification uses the same set of algorithms to describe both BCP (i.e., 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching semantics.  “Character” is used in those algorithm to refer to a single element of the mode that is currently operating within.
>
> I think the ambiguity you find is in step 2.1 of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :
>
> 2.  Return an internal closure that takes two arguments, a String str and an integer index, and performs the following:    
> 1. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character.        
>
> Apparently I don’t have an adequate definition of “interpreted as a UTF-16 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will provided one in the next spec. revisions.  The intended semantics is that:
>    In ascending string index order:
> Each valid UTF-16 surrogate pair is interpreted as a signal code point that is the UTF-16 encoded value
>         Each “lonely” surrogate is interpreted as  single code point that is the surrogate value
>         Every other 16-bit code unit is interpreted as a single code point.

That definition is in section 6.1.4:
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type

A cross-reference would be useful.

Norbert
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Q: Lonely surrogates and unicode regexps

Norbert Lindenberg
In reply to this post by Allen Wirfs-Brock

> On Jan 28, 2015, at 8:30 , Allen Wirfs-Brock <[hidden email]> wrote:
>
>
> On Jan 28, 2015, at 4:54 AM, Wes Garland <[hidden email]> wrote:
>

>> Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate?
> we already have it: \u{D83D}

Or to match any unpaired surrogate: /[\u{D800}-\u{DFFF}]/u

>>   How about reserved code points?  What happens when they become assigned?
> Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched.

There are a few places where RegExp applies Unicode semantics:

– //ui uses Unicode case folding to compare case-insensitively. If the comparison involves code points that are unassigned in the Unicode version assumed by an ECMAScript implementation and in a later version get assigned to characters that are case-variants of each other, then the RegExp behavior can change. See section 21.2.2.8.2.

– RegExp knows a few character classes: \d, \D, \s, \S, \w, \W. \d, \D, \w, \W are defined by character lists that cannot change, but \s and therefore \S could change if Unicode assigns new characters with the category “Separator, space”. See section 21.2.2.12.

But in general //u is defined based on code points and doesn’t care whether code points are assigned or reserved.

Norbert

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
12