Re: BOMs

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: BOMs

Bjoern Hoehrmann
* Martin J. Dürst wrote:
>As for what to say about whether to accept BOMs or not, I'd really want
>to know what the various existing parsers do. If they accept BOMs, then
>we can say they should accept BOMs. If they don't accept BOMs, then we
>should say that they don't.

Unicode signatures are not useful for application/json resources and are
likely to break exisiting and future code, it is not at all uncommon to
construct JSON text by concatenating, say, string literals with some web
service response without passing the data through a JSON parser. And as
RFC 4627 makes no mention of them, there is little reason to think that
implementations tolerate them.

Perl's JSON module gives me

  malformed JSON string, neither array, object, number, string
  or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")

Python's json module gives me

  ValueError: No JSON object could be decoded

Go's "encoding/json" module gives me

  invalid character 'ï' looking for beginning of value

http://shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09 is
another example of what kinds of bugs await us if we were to specify the
use of Unicode signatures for JSON, essentially

  new DOMParser().parseFromString("\uBBEF\u3CBF\u7979\u3E2F","text/xml")

Now U+BBEF U+3CBF U+7979 U+3E2F is not an XML document but Firefox and
Internet Explorer treat it as if it were equivalent to "<yy/>".
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: BOMs

Bjoern Hoehrmann
* Henry S. Thompson wrote:
>I'm curious to know what level you're invoking the parser at.  As
>implied by my previous post about the Python 'requests' package, it
>handles application/json resources by stripping any initial BOM it
>finds -- you can try this with
>
>>>> import requests
>>>> r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json")
>>>> r.json()

The Perl code was

  perl -MJSON -MEncode -e
    "my $s = encode_utf8(chr 0xFEFF) . '[]'; JSON->new->decode($s)"

The Python code was

  import json
  json.loads(u"\uFEFF[]".encode('utf-8'))

The Go code was

  package main
 
  import "encoding/json"
  import "fmt"
 
  func main() {
    r := "\uFEFF[]"
 
    var f interface{}
    err := json.Unmarshal([]byte(r), &f)
   
    fmt.Println(err)
  }

In other words, always passing a UTF-8 encoded byte string to the byte
string parsing part of the JSON implementation. RFC 4627 is the only
specification for the application/json on-the-wire format and it does
not mention anything about Unicode signatures. Looking for certain byte
sequences at the beginning and treating them as a Unicode signature is
the same as looking for `/* ... */` and treating it as a comment.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Bjoern Hoehrmann
In reply to this post by Bjoern Hoehrmann
* Tatu Saloranta wrote:
>Dominant Java implementations support UTF-16 with BOM; either directly or
>through Java's Reader implementations that handle BOMs.
>String concatenation case seems irrelevant, since BOMs are not included in
>in-memory representation anyway, as opposed to byte stream serialization.

HTTP implementations cannot correctly determine whether an entity body
is text in a single character encoding and if so what that encoding is,
accordingly the dominant API deals in byte[] arrays, not text Strings;
furthermore, many programming languages default to byte[] arrays for
string literals. That often combines into forms of

  byte[] json = sprintf('{"x": %s, "y": %s}', GET(...), GET(...));

which works fine if all three byte[] arrays are UTF-8 encoded and use
no Unicode signature, which is the case 99% of the time.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Allen Wirfs-Brock
In reply to this post by Bjoern Hoehrmann

On Nov 19, 2013, at 3:09 AM, Martin J. Dürst wrote:
...
As for JSON, it doesn't have the problem of legacy encodings. JSON by definition is encoded in an Unicode encoding form, and it's easy to distinguish these because of the restrictions on character sequences in JSON. And this can be done without a BOM (or with a BOM).

What's most important now is to know what receivers actually accept. We are not in a design phase, we are just updating the definition of JSON and making sure we fix problems if there are problems, but we have to use the installed base for the main guidance, not other protocols or formats.

There can be no doubt that the most widely deployed JSON parsers are those that are built intp the browser javascript implementations.  The ECMAScript 5 specification for JSON.parse that they implement says BOM is an illegal character.  But what do the browser actually implement?  This:

//FireFox 25 scratchpad execution:
JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/

JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/

JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/
JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/
JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/

//Safari 5.1.9 JS console
JSON.parse('\ufeff {"abc": 0} ')
  1. message"JSON Parse error: Unrecognized token '?'"

//Chrome 31 JS console
    JSON.parse('\ufeff {"abc": 0} ')
    SyntaxError: Unexpected token 
    1. message"Unexpected token "

    Unfortunately, I don't have access to IE right now,  but the trend is clear

    Allen



_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Allen Wirfs-Brock

On Nov 19, 2013, at 10:18 PM, Martin J. Dürst wrote:

Hello Henry, others,

On 2013/11/20 3:55, Henry S. Thompson wrote:
Allen Wirfs-Brock writes:

There can be no doubt that the most widely deployed JSON parsers are
those that are built intp the browser javascript implementations.
The ECMAScript 5 specification for JSON.parse that they implement
says BOM is an illegal character.  But what do the browser actually
implement?  This:

No, try e.g. jsonviewer.stack.hu [1] (works in Chrome, Safari, Opera,
not in IE or Firefox)

In Firefox, I got some garbled characters, in particular some question marks for each of the two bytes of the BOM and one question mark for the e-acute. Because of the type of the errors, I strongly suspect it is related to what we are trying to investigate, and so I don't think this can be taken as evidence one way or another.

or feed [2] to www.jsoneditoronline.org (Use
Open/Url) (works in Chrome, IE, Firefox, ran out of time to test more).

The fact that some libraries or Web sites accept a BOM for JSON isn't a proof that all (well, let's say the majority) accept a BOM.

Just to be clear about this.  My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers.  The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point .  This was done to ensure that the all transport layers  (and any transcodings they might perform) were bypassed and that we were actually testing the real built-in JSON parse functions.

Neither of the sites referenced about perform a comparable test.  They take user inputed text when is then pass through whose knows what layers of browser and application preprocessing and then they present something derived from that original user input to a JSON parser.  In both bases the actual parser does not appear to be the the built-in JavaScript JSON.parse function that I was testing.

json.view.stack.hu uses Ext.util.JSON.decode whose document describe it as "Modified version of Douglas Crockford"s json.js".   In other words not the built-in JSON.parse function

www.jsoneditoronlineorg uses a library called JSONLint in preference to the built-in JSON.parse function.  It does not conform to the ECMAScript 5 JSON.parse specification.

So testing using either of these sites say nothing relevant to about my observation concern BOM handling by the most widely deployed JSON parsers (the ones that are built into browser JavaScript implementations) 

Allen
 

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Allen Wirfs-Brock

On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:

> On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock
> <[hidden email]> wrote:
>> Just to be clear about this.  My tests directly tested JavaScript built-in
>> JSON parsers WRT to BOM support in three major browsers.  The tests directly
>> invoked the built-in JSON.parse functions and directly passed to them a
>> source strings that was explicitly constructed to contain a BOM code point .
>> This was done to ensure that the all transport layers  (and any transcodings
>> they might perform) were bypassed and that we were actually testing the real
>> built-in JSON parse functions.
>
> It would be surprising if JSON.parse() accepted a BOM, since it
> doesn't take bytes as input.

ECMAScript's JSON.parse accepts an ECMAScript string value as its input.  ECMAScript strings are sequences of 16-bit values.  JSON.parse (and most other ECMAScript functions) interpret those values  as Unicode code units.  The value U+FEFF can appear at any position within a string. When defining a string as an ECMAScript literal, a sequence like \ufeff is an escape sequence that means place the code unit value 0xefff into the string at this position in the sequence. Also note that the actual strings passed below to JSON.parse contain the actual code point value U+FEFF not the escape sequence that was used to express it.  To include the actual escape sequence characters in the string it would have to be expressed as '\\feff'.

JSON.parse('\ufeff ["XYZ"]');  //note outer quotes delimit an ECMAScript string, the inner quotes are a JSON string.  

throws a runtime SyntaxError exception because the JSON grammar does not allow U+FEFF to appear that position

JSON.parse('["\ufeffXYZ"]');

operates without error and returns a Array containing a four element ECMAScript string.   This works because the JSON grammar allows any code unit except for " and \ and the ASCII control characters to appear literally in a JSON string.


>
> However, XHR's responseType = "json" exercises browsers in a way where
> the input is bytes from the network. From the perspective of JSON
> support in XHR,
> http://lists.w3.org/Archives/Public/www-tag/2013Nov/0149.html (which
> didn't reach the es-discuss part of this thread previously) applies.

Right, JSON use via XHR is a different usage scenario and that probably involves encoding and decoding steps. It has very little to do with the JSON syntax, as defined in ECMA-404. It's all about how the bits that represent a string are interchanged, not the eventual semantic processing of the string (ie, processing by JSON.parse or some other JSON parser)

Allen

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Bjoern Hoehrmann
* Allen Wirfs-Brock wrote:
>On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:
>> On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock
>> <[hidden email]> wrote:
>>> Just to be clear about this.  My tests directly tested JavaScript built-in
>>> JSON parsers WRT to BOM support in three major browsers.  The tests directly
>>> invoked the built-in JSON.parse functions and directly passed to them a
>>> source strings that was explicitly constructed to contain a BOM code point .

>> It would be surprising if JSON.parse() accepted a BOM, since it
>> doesn't take bytes as input.
>
>ECMAScript's JSON.parse accepts an ECMAScript string value as its input.
>ECMAScript strings are sequences of 16-bit values.  JSON.parse (and most
>other ECMAScript functions) interpret those values  as Unicode code
>units.  The value U+FEFF can appear at any position within a string.
>When defining a string as an ECMAScript literal, a sequence like \ufeff
>is an escape sequence that means place the code unit value 0xefff into
>the string at this position in the sequence. Also note that the actual
>strings passed below to JSON.parse contain the actual code point value
>U+FEFF not the escape sequence that was used to express it.  To include
>the actual escape sequence characters in the string it would have to be
>expressed as '\\feff'.

A byte order mark indicates the order of bytes in a sequence of bytes.
An ecmascript string is not a sequence of bytes and therefore it cannot
have a byte order mark inside it. Your test is not for BOM support but
for an egregious semantic error in the implementation of JSON.parse.

  http://shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09

That is a similar test. It makes Firefox see UTF-8 BOMs in ecmascript
strings. Firefox is not supposed to look for UTF-8 BOMs in ecmascript
strings because ecmascript strings are not sequences of bytes at that
level of reasoning.

Is there any chance, by the way, to change `JSON.stringify` so it does
not output strings that cannot be encoded using UTF-8? Specifically,

  JSON.stringify(JSON.parse("\"\uD800\""))

would need to escape the surrogate instead of emitting it literally.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Mathias Bynens-2
On 21 Nov 2013, at 09:41, Bjoern Hoehrmann <[hidden email]> wrote:

> Is there any chance, by the way, to change `JSON.stringify` so it does
> not output strings that cannot be encoded using UTF-8? Specifically,
>
>  JSON.stringify(JSON.parse("\"\uD800\""))
>
> would need to escape the surrogate instead of emitting it literally.

Previous discussion: http://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14

_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Bjoern Hoehrmann
In reply to this post by Bjoern Hoehrmann
* John Cowan wrote:

>Bjoern Hoehrmann scripsit:
>
>> Is there any chance, by the way, to change `JSON.stringify` so it does
>> not output strings that cannot be encoded using UTF-8? Specifically,
>>
>>   JSON.stringify(JSON.parse("\"\uD800\""))
>>
>> would need to escape the surrogate instead of emitting it literally.
>
>No, there isn't.  We've been down this road repeatedly.  People can and
>do use JSON strings to encode arbitrary sequences of unsigned 16-bit integers.

The output of JSON.stringify("\uD800") contains no backslash character,
if you call `utf8_encode(JSON.stringify("\uD800"))` you get an exception
because UTF-8 cannot encode the lone surrogate and `utf8_encode` does
not know it could encode it as `\uD800` without loss of information. If
`JSON.stringify` produced an escape sequence instead, there would be no
problem passing the output to `utf8_encode`.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [Json] BOMs

Bjoern Hoehrmann
In reply to this post by Bjoern Hoehrmann
* Matt Miller (mamille2) wrote:
>There does not appear to be any consensus on explicitly allowing or
>disallowing of a Byte Order Mark (BOM).  Neither RFC4627 nor the current
>draft mention BOM anywhere, and the modus operandi of the JSON Working
>Group has been to leave text unchanged unless there was wide support.

To be clear, that means application/json entities that start with a byte
sequence that matches U+FEFF encoded in UTF-8/16/32 is malformed because
the ABNF does not allow a U+FEFF at that position (and interpreting such
a sequence as anything other than ordinary character data requires
explicit specification). I do think an informational note saying as much
could be useful.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss