BOM in script sources

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

BOM in script sources

Hallvord Reiar Michaelsen Steen-3
Hi,
I've come across an incompatibility between Opera and some other browsers:  
if there is a Unicode Zero Width No-Break Space character in the script  
source the script will not compile in Opera. This character is usually  
known as the Unicode Byte Order Mark (BOM). If it is at the start of a  
script file sent as UTF-8 it will be removed before compilation, but if it  
is inside the script and not within a string it will break the script.

According to ECMA-262 "Any other Unicode space separator <USP>" should be  
treated as whitespace. But apparently that only covers the Zs class in  
Unicode, which currently consists of the following code points:

   0020;SPACE;Zs;0;WS;;;;;N;;;;;
   00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
   1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
   180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
   2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
   2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
   2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
   2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
   205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
   3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

FEFF has the class "Cf" which means "Other, format".

Hence, Opera is complicant with the ECMA-262 spec in not considering the  
U+FEFF character a "white space" character in script source. Is this  
something Firefox would consider a bug and fix, or would it be better to  
spec ES4 to allow the U+FEFF character inside script source?

--
Hallvord R. M. Steen
Core QA JavaScript tester, Opera Software
http://www.opera.com/
Opera - simply the best Internet experience

Reply | Threaded
Open this post in threaded view
|

Re: BOM in script sources

Lars T Hansen-2
Section 7.1 of E262-3 requires all format control (class Cf)  
characters to be stripped from the source before the program is  
compiled.  Opera has never done this, and is actually at fault here.  
Mea culpa.

The ECMAScript 4 committee has since concluded that the requirement  
to strip class Cf characters is a bug in the spec (people want to  
have regexes and strings containing those characters literally) and  
ECMAScript 4 will not contain that requirement.  See http://
developer.mozilla.org/es4/proposals/update_unicode.html.

--lars


On Jan 9, 2007, at 3:08 PM, Hallvord R. M. Steen wrote:

> Hi,
> I've come across an incompatibility between Opera and some other  
> browsers: if there is a Unicode Zero Width No-Break Space character  
> in the script source the script will not compile in Opera. This  
> character is usually known as the Unicode Byte Order Mark (BOM). If  
> it is at the start of a script file sent as UTF-8 it will be  
> removed before compilation, but if it is inside the script and not  
> within a string it will break the script.
>
> According to ECMA-262 "Any other Unicode space separator <USP>"  
> should be treated as whitespace. But apparently that only covers  
> the Zs class in Unicode, which currently consists of the following  
> code points:
>
>   0020;SPACE;Zs;0;WS;;;;;N;;;;;
>   00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING  
> SPACE;;;;
>   1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
>   180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
>   2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
>   2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
>   2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
>   2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
>   205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;
>
> FEFF has the class "Cf" which means "Other, format".
>
> Hence, Opera is complicant with the ECMA-262 spec in not  
> considering the U+FEFF character a "white space" character in  
> script source. Is this something Firefox would consider a bug and  
> fix, or would it be better to spec ES4 to allow the U+FEFF  
> character inside script source?
>
> --
> Hallvord R. M. Steen
> Core QA JavaScript tester, Opera Software
> http://www.opera.com/
> Opera - simply the best Internet experience
> _______________________________________________
> Es4-discuss mailing list
> [hidden email]
> https://mail.mozilla.org/listinfo/es4-discuss