Proposed work for Universal Charset Detector

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposed work for Universal Charset Detector

John Gardiner Myers-2
I need to adapat the Universal Charset Detector for use in an email
application, SpamAssassin.  I am proposing to do the follwing work:

1) Further separate the base charset detector code from the xpcom glue.
  The base code would be compiled into a static library which would then
be linked together with the xpcom glue into a shared library.

2) Modify UniversalChardetTest so that it links against the static
library, removing the dependency on xpcom.

3) Create the framwork for a regression test suite.

4) Fix the outstanding detection bugs.  The list I have is 168526,
178495, 181344, 177505, 285435, 301915, 306224, and 306272.

5) Extend class nsUniversalDetector to take an optional charset label
from its caller.  In email applications, the caller would pass in a
value derived from the MIME charset label.  Should the confidence of the
detection be sufficiently low, this label could affect the eventual
report.  (In email, it is not uncommon for the MIME charset label to be
incorrect.)

I will also be creating and submitting to CPAN a Perl module that
exposes the universal charset detector.
_______________________________________________
mozilla-i18n mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-i18n
Reply | Threaded
Open this post in threaded view
|

Re: Proposed work for Universal Charset Detector

Jean-Marc Desperrier
John Gardiner Myers wrote:
> 1) Further separate the base charset detector code from the xpcom glue.
>  The base code would be compiled into a static library which would then
> be linked together with the xpcom glue into a shared library.

Great :-)

> 2) Modify UniversalChardetTest so that it links against the static
> library, removing the dependency on xpcom.

Could you do another version ? Keeping UniversalChardetTest as an
example of how to do it with xpcom and have another file with a direct
link ?

> 3) Create the framwork for a regression test suite.

Shanjian had samples files thta could be used to test regressions.
Simon Montagu  in Mozilla's current i18n team who has a contact with him
(see https://bugzilla.mozilla.org/show_bug.cgi?id=86999#c23) and I think
I remember he talked about trying to get them.

> 4) Fix the outstanding detection bugs.  The list I have is 168526,
> 178495, 181344, 177505, 285435, 301915, 306224, and 306272.

Good list. If you find more, can you add them as dependencies of bug
264871 ?

> 5) Extend class nsUniversalDetector to take an optional charset label
> from its caller.  In email applications, the caller would pass in a
> value derived from the MIME charset label.  Should the confidence of the
> detection be sufficiently low, this label could affect the eventual
> report.  (In email, it is not uncommon for the MIME charset label to be
> incorrect.)

I think that point is not something really interesting for Mozilla.

First, in my experience it's quite rare nowadays that the MIME charset
label is present and incorrect, second doing something like that
encourages people not to care about correctly labeling email and helps
propagating that error, when the ultimate aim is that everybody puts
corrects labels on mail and web pages.

So I don't think your algorithm where you use MIME charset should be in
the Mozilla code.

But nsUniversalDetector is a virtual class, it's member mCharSetProbers
that has the detailled results before the selection of one charset is
protected not private, therefore couldn't you put that modification in
your class that instanciate nsUniversalDetector and be free to do what
you want without changing what Mozilla uses ?

I don't see a problem if that code gets in the non-xpcom replacement for
UniversalChardetTest.

> I will also be creating and submitting to CPAN a Perl module that
> exposes the universal charset detector.

I foresee it will become popular :-)
_______________________________________________
mozilla-i18n mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-i18n
Reply | Threaded
Open this post in threaded view
|

Re: Proposed work for Universal Charset Detector

John Gardiner Myers-2
Jean-Marc Desperrier wrote:

> John Gardiner Myers wrote:
>
>> 5) Extend class nsUniversalDetector to take an optional charset label
>> from its caller.  In email applications, the caller would pass in a
>> value derived from the MIME charset label.  Should the confidence of
>> the detection be sufficiently low, this label could affect the
>> eventual report.  (In email, it is not uncommon for the MIME charset
>> label to be incorrect.)
>
>
> I think that point is not something really interesting for Mozilla.

I could see Thunderbird wanting to use it.

>
> First, in my experience it's quite rare nowadays that the MIME charset
> label is present and incorrect, second doing something like that
> encourages people not to care about correctly labeling email and helps
> propagating that error, when the ultimate aim is that everybody puts
> corrects labels on mail and web pages.

For a long time it has been common for Korean e-mail to be in the EUC-KR
charset but with a MIME label of ISO-8859-1.  This derived from the use
of metamail, which decoded MIME encoded-words but ignored the label and
encoded with a default label of ISO-8859-1.

In my corpus of Japanese spam, there is a substantial portion of
messages with an incorrect MIME charset label.  Using the charset label
instead of the detected charset in a spam filtering application actually
encourages such incorrect labeling, as it provides the spammers an
effective mechanism to obscure their text from the spam filter.

>
> So I don't think your algorithm where you use MIME charset should be
> in the Mozilla code.

I don't see that there is a substantial disadvantage to Mozilla of
having the detector class support a feature that Mozilla itself does not
use.

The disadvantage to Mozilla of refusing this feature is that it makes it
much more likely that I will fork the charset detector code.  Mozilla
would then miss out on the bug fixes and other improvements I make to
the detection features it does use.

>
> But nsUniversalDetector is a virtual class, it's member
> mCharSetProbers that has the detailled results before the selection of
> one charset is protected not private, therefore couldn't you put that
> modification in your class that instanciate nsUniversalDetector and be
> free to do what you want without changing what Mozilla uses ?

I don't consider that a tenable approach.  The top-level and group
prober classes would have to be substantially duplicated, using
implementation knowledge that really shouldn't be so visible.

>
> I don't see a problem if that code gets in the non-xpcom replacement
> for UniversalChardetTest.

I don't see exposing the caller-supplied charset feature through xpcom
until Thunderbird or some other xpcom client asks for it.

>> I will also be creating and submitting to CPAN a Perl module that
>> exposes the universal charset detector.
>
>
> I foresee it will become popular :-)

It's now available as Encode::Detect.
_______________________________________________
mozilla-i18n mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-i18n
Reply | Threaded
Open this post in threaded view
|

Re: Proposed work for Universal Charset Detector

Shoshannah Forbes-2
In reply to this post by Jean-Marc Desperrier
--- Jean-Marc DespDesperrierdejmdespsalussinan> wrote:
> First, in my experience it's quite rare nowadays
> that the MIME charcharset label is present and
incorrect,

It is still a problem with two major
webmwebmailviders: yahoo and hotmhotmail least when
sending Hebrew email from them (the mail gets marked
as latilatinSince they are so large, this isn't so
rare to see, unfortunately.

> second doing
> something like that
> encourages people not to care about correctly
> labeling email and helps
> propagating that error, when the ultimate aim is
> that everybody puts
> corrects labels on mail and web pages

Agreed here :-)

Xslf
http://www.xslf.com
Yahoo Messenger ID: Xslf


       
               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
_______________________________________________
mozilla-i18n mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-i18n
Reply | Threaded
Open this post in threaded view
|

Re: Proposed work for Universal Charset Detector

Jean-Marc Desperrier
In reply to this post by John Gardiner Myers-2
John Gardiner Myers wrote:
> In my corpus of Japanese spam, there is a substantial portion of
> messages with an incorrect MIME charset label.

I forgot to add that this is probably quite more frequent with spam, and
IMO would make a valid rule to highen slightly the spam assassin score.

> [...] Using the charset label
> instead of the detected charset in a spam filtering application actually
> encourages such incorrect labeling, as it provides the spammers an
> effective mechanism to obscure their text from the spam filter.

Well, yes, but to avoid that you don't need to change the current code ?
You just need to run the UCD for all messages, even with a label.

>> So I don't think your algorithm where you use MIME charset should be
>> in the Mozilla code.
>
> I don't see that there is a substantial disadvantage to Mozilla of
> having the detector class support a feature that Mozilla itself does not
> use.

There's a philosophy of not including code that is not used, and I
thought the alternative was viable. But after all, if what you add to
nsUniversalDetector is neglectable.

> The disadvantage to Mozilla of refusing this feature is that it makes it
> much more likely that I will fork the charset detector code.

I'm presenting you arguments against your option, but it's only a i18n
peer that could take that decision. Go ahead and ask if you want to
include it.

>> But nsUniversalDetector is a virtual class, it's member
>> mCharSetProbers that has the detailled results before the selection of
>> one charset is protected not private, therefore couldn't you put that
>> modification in your class that instanciate nsUniversalDetector and be
>> free to do what you want without changing what Mozilla uses ?
>
> I don't consider that a tenable approach.  The top-level and group
> prober classes would have to be substantially duplicated, using
> implementation knowledge that really shouldn't be so visible.

Hum, I'm beginning to wonder if I understood what you want to do.
IMO the knowledge that you would need is only GetConfidence and
GetCharSetName from nsCharSetProber and that is public already in
nsCharSetProber.h.

I'm thinking of an alternative.
What if nsCharSetProber just returns the best charset and it's
probability in the form of a double ? Is that enough for your needs ?

>> I don't see a problem if that code gets in the non-xpcom replacement
>> for UniversalChardetTest.
>
> I don't see exposing the caller-supplied charset feature through xpcom
> until Thunderbird or some other xpcom client asks for it.

OK.

If you answer, note that I will be gone until tuesday.
_______________________________________________
mozilla-i18n mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-i18n
Reply | Threaded
Open this post in threaded view
|

Re: Proposed work for Universal Charset Detector

John Gardiner Myers-2
Jean-Marc Desperrier wrote:

> Hum, I'm beginning to wonder if I understood what you want to do.
> IMO the knowledge that you would need is only GetConfidence and
> GetCharSetName from nsCharSetProber and that is public already in
> nsCharSetProber.h.
>
> I'm thinking of an alternative.
> What if nsCharSetProber just returns the best charset and it's
> probability in the form of a double ? Is that enough for your needs ?

I don't think it is.  Among the single-byte charsets, I think I'd need
the confidence of the prober for the labeled charset, to see if it is
close to the confidence of the detected charset.  Put another way, I'd
like the caller to be able to boost the confidence of the labeled
charset, to make it more likely to be the detected charset.
_______________________________________________
mozilla-i18n mailing list
[hidden email]
http://mail.mozilla.org/listinfo/mozilla-i18n