Tag-Soup- versus XML-Parser

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Tag-Soup- versus XML-Parser

Ernst Beiglböck-2
Hi folks,
as I can see, there are at least to different methods of parsing a (x)
html-document: a "tag-soup"-parser which is very error-prone and makes
the best out of non-valid websites. and a xml-parser, which is only
activated if a document is served with mime-type application/xhtml
+xml.
IMHO an xml-parser should be a lot faster, because it just stops if
there is an error, and has not to concern error-handling.

I would be very interested in performance-comparison between the
mostly used tag-soup-parser and the xml-parser for xhtml1.x documents
which are correctly served as application/xhtml+xml.

can you maybe give me any metrics?

or, better, is it possible to extract the gecko document parser and
benchmark it standalone with various documents?

the answer to this question is relevant if it is preferable to build
valid xml xhtml documents or just stick to html 4.01.

thank you very much!
_______________________________________________
dev-performance mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-performance
Reply | Threaded
Open this post in threaded view
|

Re: Tag-Soup- versus XML-Parser

Boris Zbarsky
[Please don't cross-post without setting followup-to; set that to
.performance]

Ernst Bauernfeind wrote:
> as I can see, there are at least to different methods of parsing a (x)
> html-document: a "tag-soup"-parser which is very error-prone and makes
> the best out of non-valid websites. and a xml-parser, which is only
> activated if a document is served with mime-type application/xhtml
> +xml.

You mean "error-correcting", not "error-prone", right?

> IMHO an xml-parser should be a lot faster, because it just stops if
> there is an error, and has not to concern error-handling.

It also needs to implement very different parsing rules from HTML in
general, keep track of namespaces, etc.

> I would be very interested in performance-comparison between the
> mostly used tag-soup-parser and the xml-parser for xhtml1.x documents
> which are correctly served as application/xhtml+xml.
>
> can you maybe give me any metrics?

Parser performance per se is more or less a wash, from what I've seen.
It's also generally been a small enough component of pageload time (15%
or less) that this aspect should be about the last factor in decidint
whether to use use XML or not.

> or, better, is it possible to extract the gecko document parser and
> benchmark it standalone with various documents?

Why?  That would give you a somewhat useless number for your purposes
(see below).

> the answer to this question is relevant if it is preferable to build
> valid xml xhtml documents or just stick to html 4.01.

That's an entirely different question from a performance standpoint.
Typically, the HTML codepath receives more attention in terms of
optimization and profiling; if we have to sacrifice XHTML-as-XML
performance in favor of HTML performance, we do so.

There are also common markup constructs that actually do significantly
different things in XHTML-as-XML and in HTML.  A good example:

   <table>
     <tr><td>Text</td></tr>
   </table>

This produces different DOMs in HTML and in XHTML-as-XML; the HTML one
is faster to lay out, especially if you plan to do any dynamic addition
or removal of rows in that table.

-Boris
_______________________________________________
dev-performance mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-performance