Re: es-discuss Digest, Vol 81, Issue 82

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: es-discuss Digest, Vol 81, Issue 82

Mihai Niță
I would add my two cents here.


Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used.


And there is something in RFC 4627 that tells me JSON is not BOM-aware:
==================
   JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8.

   Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8
==================
These patterns are not BOM, otherwise  they would be something like this:
           00 00 FE FF  UTF-32BE
           FE FF xx xx  UTF-16BE
           FF FE 00 00  UTF-32LE
           FF FE xx xx  UTF-16LE
           EF BB BF xx  UTF-8

It is kind of unfortunate that "the precise type of the data stream" is not determined, and BOM is not accepted.

But a mechanism to decide the encoding is specified in the RFC, and it does not include a BOM, in fact it prevents the use of BOM
(00 00 FE FF does not match the 00 00 00 xx pattern, for instance)


So, "by the RFC", BOM is not expected / understood.

-----

Although I am afraid that the RFC has a problem:
I think "日本語" (U+0022 U+65E5 U+672C U+8A9E U+0022) is valid JSON (same as "foo").

The first four bytes are:
           00 00 00 22  UTF-32BE
           00 22 E5 65  UTF-16BE
           22 00 00 00  UTF-32LE
           22 00 65 E5  UTF-16LE
           22 E6 97 A5  UTF-8
The UTF-16 bytes don't match the patterns in RFC, so UTF-16 streams would (wrongly) be detected as UTF-8, if one strictly follows the RFC.

Regards,
Mihai


======================================================
From: Bjoern Hoehrmann <[hidden email]>
To: [hidden email] (Henry S. Thompson)
Cc: IETF Discussion <[hidden email]>, JSON WG <[hidden email]>, "Martin J. Dürst" <[hidden email]>, [hidden email], es-discuss <[hidden email]>
Date: Mon, 18 Nov 2013 14:48:19 +0100
Subject: Re: BOMs
* Henry S. Thompson wrote:
>I'm curious to know what level you're invoking the parser at.  As
>implied by my previous post about the Python 'requests' package, it
>handles application/json resources by stripping any initial BOM it
>finds -- you can try this with
>
>>>> import requests
>>>> r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json")
>>>> r.json()

The Perl code was

  perl -MJSON -MEncode -e
    "my $s = encode_utf8(chr 0xFEFF) . '[]'; JSON->new->decode($s)"

The Python code was

  import json
  json.loads(u"\uFEFF[]".encode('utf-8'))

The Go code was

  package main

  import "encoding/json"
  import "fmt"

  func main() {
    r := "\uFEFF[]"

    var f interface{}
    err := json.Unmarshal([]byte(r), &f)

    fmt.Println(err)
  }

In other words, always passing a UTF-8 encoded byte string to the byte
string parsing part of the JSON implementation. RFC 4627 is the only
specification for the application/json on-the-wire format and it does
not mention anything about Unicode signatures. Looking for certain byte
sequences at the beginning and treating them as a Unicode signature is
the same as looking for `/* ... */` and treating it as a comment.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: <a href="tel:%2B49%280%29160%2F4415681" value="+491604415681" style="font-family:arial,sans-serif;font-size:13px">+49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: es-discuss Digest, Vol 81, Issue 82

Bjoern Hoehrmann
* [hidden email] wrote:

>The first four bytes are:
>
>           00 00 00 22  UTF-32BE
>           00 22 E5 65  UTF-16BE
>           22 00 00 00  UTF-32LE
>           22 00 65 E5  UTF-16LE
>           22 E6 97 A5  UTF-8
>
>The UTF-16 bytes don't match the patterns in RFC, so UTF-16 streams would
>(wrongly) be detected as UTF-8, if one strictly follows the RFC.

RFC 4627 does not allow string literals at the top level.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss
Reply | Threaded
Open this post in threaded view
|

Re: es-discuss Digest, Vol 81, Issue 82

Mihai Niță
Sorry, I just took the first sentence here (second one added to the confusion, not clarified it, but this is probably just me):
A JSON text is a sequence of tokens.  The set of tokens includes six structural characters, strings, numbers, and three literal names.
A JSON text is a serialized object or array.

Anyway, this is good. It means that the RFC has no problem, it's just me :-)

But the conclusion that the RFC does not allow BOM is independent, and I think it stands.

Mihai



On Mon, Nov 18, 2013 at 9:50 AM, Bjoern Hoehrmann <[hidden email]> wrote:
* [hidden email] wrote:
>The first four bytes are:
>
>           00 00 00 22  UTF-32BE
>           00 22 E5 65  UTF-16BE
>           22 00 00 00  UTF-32LE
>           22 00 65 E5  UTF-16LE
>           22 E6 97 A5  UTF-8
>
>The UTF-16 bytes don't match the patterns in RFC, so UTF-16 streams would
>(wrongly) be detected as UTF-8, if one strictly follows the RFC.

RFC 4627 does not allow string literals at the top level.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: <a href="tel:%2B49%280%29160%2F4415681" value="+491604415681">+49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/


_______________________________________________
es-discuss mailing list
[hidden email]
https://mail.mozilla.org/listinfo/es-discuss