Unicode in the DOM ?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode in the DOM ?

Emile Kroeger
Hello, I'm working on a firefox extension and I'm not entirely sure
this is the best place to post, but:

I'd like to access text from the webpage in unicode format. However,
when I parse the DOM tree of the browser content, I get another format
(and the format seems to depend on the Character Encoding I chose in
the browser, but it looks like it's never given in unicode).

Now, there are some XPCOM charset conversion utilities that would
theoretically do the job, but those would require knowing the charset
used in the page (which can probably be retrieved, but it sounds like
a pain).

I read that Gecko's internal representaiton is in unicode, accessing
that would be neat. Does Gecko then create the "content" DOM tree in a
more specific encoding for cleaner display ? Would there be a way to
access the unicode representation Gecko creates ?

(Please forgive my approximate terminology, I'm still very much a
newbie to all this; feel free to correct me)

Thank you,

Emile

_______________________________________________
mozilla-layout mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-layout
Reply | Threaded
Open this post in threaded view
|

Re: Unicode in the DOM ?

Boris Zbarsky
Emile Kroeger wrote:
> I'd like to access text from the webpage in unicode format. However,
> when I parse the DOM tree of the browser content

I'm not sure what you mean.  Parsing is the process that turns a stream of
characters into a DOM tree.  You don't parse the DOM tree.

All data in the DOM in Mozilla is stored either in UTF16, UTF8, or ASCII
(depending on the exact piece of data).

All data in JavaScript is in UCS-2.

So what are you doing exactly, to get something in a different encoding than one
of those?  ;)

-Boris
_______________________________________________
mozilla-layout mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-layout
Reply | Threaded
Open this post in threaded view
|

Re: Unicode in the DOM ?

James Ross
Boris Zbarsky wrote:

> Emile Kroeger wrote:
>> I'd like to access text from the webpage in unicode format. However,
>> when I parse the DOM tree of the browser content
>
> I'm not sure what you mean.  Parsing is the process that turns a stream
> of characters into a DOM tree.  You don't parse the DOM tree.
>
> All data in the DOM in Mozilla is stored either in UTF16, UTF8, or ASCII
> (depending on the exact piece of data).
>
> All data in JavaScript is in UCS-2.
>
> So what are you doing exactly, to get something in a different encoding
> than one of those?  ;)

Sounds like XMLHttpRequest to me, which does some *really* dodgy
character encoding stuff.

Emile, if you are using XMLHttpRequest, and you absolutely know the
character encoding the data is in (and its MIME type!), you can do this
just after calling open():

request.overrideMimeType("mime/type; charset=utf-8");

(assuming the data is in UTF-8 - use whatever charset is appropriate)

Of course, the *best* way to fix it is to make the server send the
character encoding with the content...

--
James Ross <[hidden email]>
ChatZilla Developer
_______________________________________________
mozilla-layout mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-layout
Reply | Threaded
Open this post in threaded view
|

Re: Unicode in the DOM ?

Emile Kroeger
> > So what are you doing exactly, to get something in a different encoding
> > than one of those?  ;)
>
> Sounds like XMLHttpRequest to me, which does some *really* dodgy
> character encoding stuff.

Nono, I was just reading off a normal webpage - greasemonkey-like stuff :P

I read the text from the page, sent it to Python, and Python said "Wah
! It's not unicode !"

Since I noticed that I'd get different stings when I selected
different encodings for the page (which was orignially in unicode), I
assumed that the charset was being converted from unicode to whatever
was convenient for the display.

Turns out I was wrong, and that the javascript strings were indeed in
two-byte unicode (at least, when I had selected the correct encoding
in the browser), but that the strings were being cast into one-byte
strings when I sent them to python, because I was using the wrong
chartype ("string" instead of "wstring") in the XPIDL interface
definitions. I guess that's what I get when I work with two
dynamically-typed language with a glue layer of statically typed
language in between :P (And when my JavaScript skills are not that
great)

So, it had nothing to do with Gecko at all, sorry for the disturbance.

Emile

_______________________________________________
mozilla-layout mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-layout