Hello, I'm working on a firefox extension and I'm not entirely sure
this is the best place to post, but:
I'd like to access text from the webpage in unicode format. However,
when I parse the DOM tree of the browser content, I get another format
(and the format seems to depend on the Character Encoding I chose in
the browser, but it looks like it's never given in unicode).
Now, there are some XPCOM charset conversion utilities that would
theoretically do the job, but those would require knowing the charset
used in the page (which can probably be retrieved, but it sounds like
I read that Gecko's internal representaiton is in unicode, accessing
that would be neat. Does Gecko then create the "content" DOM tree in a
more specific encoding for cleaner display ? Would there be a way to
access the unicode representation Gecko creates ?
(Please forgive my approximate terminology, I'm still very much a
newbie to all this; feel free to correct me)
> Emile Kroeger wrote:
>> I'd like to access text from the webpage in unicode format. However,
>> when I parse the DOM tree of the browser content
> I'm not sure what you mean. Parsing is the process that turns a stream
> of characters into a DOM tree. You don't parse the DOM tree.
> All data in the DOM in Mozilla is stored either in UTF16, UTF8, or ASCII
> (depending on the exact piece of data).
> So what are you doing exactly, to get something in a different encoding
> than one of those? ;)
Sounds like XMLHttpRequest to me, which does some *really* dodgy
character encoding stuff.
Emile, if you are using XMLHttpRequest, and you absolutely know the
character encoding the data is in (and its MIME type!), you can do this
just after calling open():
> > So what are you doing exactly, to get something in a different encoding
> > than one of those? ;)
> Sounds like XMLHttpRequest to me, which does some *really* dodgy
> character encoding stuff.
Nono, I was just reading off a normal webpage - greasemonkey-like stuff :P
I read the text from the page, sent it to Python, and Python said "Wah
! It's not unicode !"
Since I noticed that I'd get different stings when I selected
different encodings for the page (which was orignially in unicode), I
assumed that the charset was being converted from unicode to whatever
was convenient for the display.
two-byte unicode (at least, when I had selected the correct encoding
in the browser), but that the strings were being cast into one-byte
strings when I sent them to python, because I was using the wrong
chartype ("string" instead of "wstring") in the XPIDL interface
definitions. I guess that's what I get when I work with two
dynamically-typed language with a glue layer of statically typed
So, it had nothing to do with Gecko at all, sorry for the disturbance.