I’ve been looking through the source code of Thunderbird’s spam filter
and have a couple of questions on how it works. Note my understanding of
the code is listed after the questions.
(1) Why is it necessary to convert to exponential notation? Is that much
precision really needed? (lines 1161-1170)
(2) Why was a custom inverse chi-squared function designed rather than
using a math package? What issues should I be aware of in choosing a
Java math package to calculate my inverse chi-squared functions?
(3) Are there known improvements in the algorithm that are not
implemented yet? What are the plans for the future of the spam filtering?
My analysis of how the source code works:
First if there are no good tokens Thunderbirds assumes the message is
junk mail. Likewise if there are no bad tokens Thunderbird assumes it is
a good message. The rest seems to be heavily based on Gary Robinson’s
Linux journal article: A Statistical Approach to the Spam Problem
(
http://www.linuxjournal.com/article/6467 ) with some minor
modifications. Thunderbird considers the top 150 most interesting tokens
not the top 15. Thunderbird ignores all tokens less interesting than .1
presumably to speed things up. You choose different default parameters
(line 1130) for the strength of background info (s = 45%) and the
assumed probability that new words will appear in spam (x = 50%).
Finally any email that ends up with a spam certainty > 99% gets marked
as spam.
Also just out of general interest:
(4) Why was it decided to integrate the spam filter directly into
thunderbird rather than having it as a pluggable component? i.e. why not
just create a SpamBayes extension that is bundled with Thunderbird by
default.
Scott MacGregor wrote:
_______________________________________________
dev-apps-thunderbird mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-apps-thunderbird