Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Mark A. Ziesemer-3
Use case:

Assume JavaScript in a Mozilla Firefox extension.
1) Create or open a new XML document for modification using DOM.
2) Set an XML attribute value containing a tab character.  e.g.,
node.setAttribute("someName", "some /t Value");
3) Save the XML using XMLSerializer.
4) Re-read the XML using XMLHttpRequest.

The problem is that this currently would be read back in as "some Value"
instead of "some /t Value".

Verified that the saved XML contains 13 characters, including
[space,tab,space].

Missing the tab when reading this is somewhat expected, per what I
understand of the W3C specification regarding attribute normalization:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-white-space

When manually changing the saved XML to instead use a character entity
("some 	 Value"), the read from XMLHttpRequest works as desired,
returning "some /t Value" for the "someName" attribute.

I am unable to determine how to save values with character entities
using XMLSerializer.  It properly escapes values such as '&', replacing
it with "&".  However, as such, trying to save "	" results in
the entity being escaped "	".  This should be expected, but the
question remains: How can a character entity be written using XMLSerializer?

(I just verified that using Apache Xalan-J (the default JAXP transformer
under Java 6) serializes '/t' using a character entity.)

--
Mark A. Ziesemer
www.ziesemer.com
_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Boris Zbarsky
Mark A. Ziesemer wrote:
> Verified that the saved XML contains 13 characters, including
> [space,tab,space].

Parsing that should give a 3-letter attr value in Gecko.  As a simple test:

   data:text/xml,<root%20attr="%20%09%20"/>

   javascript:alert(document.documentElement.getAttribute("attr").length)

> http://www.w3.org/TR/2000/REC-xml-20001006#sec-white-space

I don't believe we perform such normalization.

Is there a testcase somewhere that shows your problem?  You should be able to
use XMLHttpRequest in a web page just as well as in an extension to construct
such a testcase...

> This should be expected, but the
> question remains: How can a character entity be written using
> XMLSerializer?

It can't.  The point of XMLSerializer is to expose a very simple easy to use
serialization model.  Trusted callers who need something more complicated can
use nsIDocumentEncoder with all its complexity directly (though I'm not sure
even that will let you entity-encode tabs, necessarily).

-Boris

_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Christian Biesinger
Boris Zbarsky wrote:
> Parsing that should give a 3-letter attr value in Gecko.  As a simple test:
>
>   data:text/xml,<root%20attr="%20%09%20"/>
>
>   javascript:alert(document.documentElement.getAttribute("attr").length)

But
javascript:alert(document.documentElement.getAttribute("attr").charCodeAt(1).toString(16))
alerts (hex) 20, i.e. a space.

Which is consistent with http://www.w3.org/TR/xml11/#AVNormalize (and
http://www.w3.org/TR/xml/#AVNormalize), which are better references than
the xml:space one from the original post.



_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Mark A. Ziesemer-3
Christian Biesinger wrote:

> Boris Zbarsky wrote:
>> Parsing that should give a 3-letter attr value in Gecko.  As a simple
>> test:
>>
>>   data:text/xml,<root%20attr="%20%09%20"/>
>>
>>   javascript:alert(document.documentElement.getAttribute("attr").length)
>
> But
> javascript:alert(document.documentElement.getAttribute("attr").charCodeAt(1).toString(16))
> alerts (hex) 20, i.e. a space.
>
> Which is consistent with http://www.w3.org/TR/xml11/#AVNormalize (and
> http://www.w3.org/TR/xml/#AVNormalize), which are better references than
> the xml:space one from the original post.

I am confused with the above example, mostly with how "document" is
being populated from the given "root" example element.  Actually, per my
testing (as mentioned in the original post), XMLHttpRequest returns the
data as expected, including a tab character, IF the XML was saved using
a character entity for the tab.

I think the assumption should be that XMLHttpRequest is working as
expected.  I think that the XMLSerializer is what needs to be looked at.
    As mentioned previously, it is operating differently than Apache
Xalan-J which seems to be a pretty good reference implementation.

I'll also respond to Boris's post regarding nsIDocumentEncoder.
_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Mark A. Ziesemer-3
In reply to this post by Boris Zbarsky
Boris Zbarsky wrote:
> Is there a testcase somewhere that shows your problem?  You should be
> able to use XMLHttpRequest in a web page just as well as in an extension
> to construct such a testcase...

I'll work on creating a simple test case and including it here.
However, I'm pretty sure that this is a problem with the functionality
of the XMLSerializer, not the XMLHttpRequest.

>> This should be expected, but the question remains: How can a character
>> entity be written using XMLSerializer?
>
> It can't.  The point of XMLSerializer is to expose a very simple easy to
> use serialization model.  Trusted callers who need something more
> complicated can use nsIDocumentEncoder with all its complexity directly
> (though I'm not sure even that will let you entity-encode tabs,
> necessarily).

I'm fine with "very simple" and "easy to use", but then it should work
by default - even for the not-so-common but within-spec cases.  Without
any options to set, it seems impossible to call it a configuration/usage
issue.  The fact that it uses some character entities (e.g. "&amp;") but
not others seems to be a little bit of an issue.  (Are there any
comments with how the Xalan-J implementation handles this?)

(While I'm looking for supporting a tab character, it seems that this
could also affect a number of other non-"word" characters.  In this
particular case, I'm trying to backup lines out of the hostperm.1 file,
which is clearly specified to contain tab-separated fields.  Replacing
with spaces on both the read/write ends seems to be a bad idea, in case
a space would become valid within one of the fields.)

I'm open to using an alternate serialization method, as long as it is
guaranteed to be available for use within a Firefox 2.x extension.  Per
the note at http://developer.mozilla.org/en/docs/XMLSerializer, "it's
not part of any standard and level of support in other browsers is unknown".

Unfortunately, your suggestion to use nsIDocumentEncoder seems much
easier said than done.  Searching for "nsIDocumentEncoder" or even
"DocumentEncoder" on the MDC Wiki currently yields 0 results.  A global
Google search seems almost as disappointing.  A simple example of
serializing a XML Document using it from within JavaScript would be most
ideal.

The only other alternative I can think of at the moment is to place
these types of values inside of CDATA elements rather than attributes,
which will end up making my XML structure much longer and more complex.
_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Boris Zbarsky
Mark A. Ziesemer wrote:
> I'm fine with "very simple" and "easy to use", but then it should work
> by default - even for the not-so-common but within-spec cases.  Without
> any options to set, it seems impossible to call it a configuration/usage
> issue.  The fact that it uses some character entities (e.g. "&amp;") but
> not others seems to be a little bit of an issue.

You have to use &amp; to get well-formed XML out, basically.  The default
serialization mode is to use as few entities as possible while still producing a
correct and valid serialization of the DOM.

If this process is lossy for the tab character in XML because the XML parser
normalizes away literal tabs, then a bug should be filed on changing how it's
handled.

> I'm open to using an alternate serialization method

For what you're doing, why not just serialize directly into a text file?  It
doesn't sound like you're starting with XML, right?  So why try to shoehorn your
data into XML, with its funny whitespace rules?

Alternately, you could base64-encode your data to eliminate any
possibly-problematic characters.

> Unfortunately, your suggestion to use nsIDocumentEncoder seems much
> easier said than done.  Searching for "nsIDocumentEncoder" or even
> "DocumentEncoder" on the MDC Wiki currently yields 0 results.

You can't use it from script in Firefox 2 anyway.  I keep forgetting that not
everyone is working against the trunk...

-Boris

_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Mark A. Ziesemer-3
Boris Zbarsky wrote:
> You have to use &amp; to get well-formed XML out, basically.  The
> default serialization mode is to use as few entities as possible while
> still producing a correct and valid serialization of the DOM.
>
> If this process is lossy for the tab character in XML because the XML
> parser normalizes away literal tabs, then a bug should be filed on
> changing how it's handled.

Now that we've pretty much confirmed that round-tripping tab characters
is pretty much impossible, I probably will open a bug report, at least
for tracking purposes in case others are running into similar issues.

> For what you're doing, why not just serialize directly into a text
> file?  It doesn't sound like you're starting with XML, right?  So why
> try to shoehorn your data into XML, with its funny whitespace rules?

This is just one of many other data elements I'm saving.  Its working
great for everything else.  I can then also use XPath against it for
searching later.

> Alternately, you could base64-encode your data to eliminate any
> possibly-problematic characters.

I probably will just use some XML related workaround - base64, CDATA, or
otherwise.  Following XML ideals, I'd actually split up the tabbed
fields and store them as separate XML entities - though not doing so
makes the code more resilient to change if the data format is added to.

>> Unfortunately, your suggestion to use nsIDocumentEncoder seems much
>> easier said than done.  Searching for "nsIDocumentEncoder" or even
>> "DocumentEncoder" on the MDC Wiki currently yields 0 results.
>
> You can't use it from script in Firefox 2 anyway.  I keep forgetting
> that not everyone is working against the trunk...

Thanks for that time-saving note!  :-)  Like I said above, I'll just use
an XML workaround for now.  Yeah, even working from trunk and the alpha
builds may be preferred for developers, for the latest features and
fixes, but that's quite a minimum dependency for an extension!  :-)

Thanks for all the information!
_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Jonas Sicking-2
In reply to this post by Boris Zbarsky
Boris Zbarsky wrote:
>> This should be expected, but the question remains: How can a character
>> entity be written using XMLSerializer?
>
> It can't.  The point of XMLSerializer is to expose a very simple easy to
> use serialization model.  Trusted callers who need something more
> complicated can use nsIDocumentEncoder with all its complexity directly
> (though I'm not sure even that will let you entity-encode tabs,
> necessarily).

Isn't it a bug that we're not creating a character entity in this case
though? We're creating entities for characters that would produce
invalid XML (<, &, >, etc) while serializing textnodes, so why not
create them for whitespace when serializing attributes.

/ Jonas
_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Boris Zbarsky
Jonas Sicking wrote:
> Isn't it a bug that we're not creating a character entity in this case
> though? We're creating entities for characters that would produce
> invalid XML (<, &, >, etc) while serializing textnodes, so why not
> create them for whitespace when serializing attributes.

Well.  Putting a literal tab in an attribute does not produce "invalid" XML by
any means.

But yes, I think that the lack of round-trippability is a bug.

-Boris


_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Jonas Sicking-2
Boris Zbarsky wrote:
> Jonas Sicking wrote:
>> Isn't it a bug that we're not creating a character entity in this case
>> though? We're creating entities for characters that would produce
>> invalid XML (<, &, >, etc) while serializing textnodes, so why not
>> create them for whitespace when serializing attributes.
>
> Well.  Putting a literal tab in an attribute does not produce "invalid"
> XML by any means.

Right, but it also doesn't correctly serialize the DOM.

> But yes, I think that the lack of round-trippability is a bug.

Mark, please file a bug on this so that we can track it. Unfortunately
it is unlikely to get fixed for Firefox 3 unless someone steps up and
provides a patch.

/ Jonas
_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml
Reply | Threaded
Open this post in threaded view
|

Re: Unable to round-trip tab character using XMLHttpRequest and XMLSerializer

Mark-306
On Oct 1, 7:40 pm, Jonas Sicking <[hidden email]> wrote:

> Mark, please file a bug on this so that we can track it. Unfortunately
> it is unlikely to get fixed for Firefox 3 unless someone steps up and
> provides a patch.

Done.  https://bugzilla.mozilla.org/show_bug.cgi?id=398272

_______________________________________________
dev-tech-xml mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-tech-xml