Thunderbird 3.0 global / full-text search support for CJK languages landed, will show up in nightlies tomorrow, requires a new database.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Thunderbird 3.0 global / full-text search support for CJK languages landed, will show up in nightlies tomorrow, requires a new database.

Andrew Sutherland-3
== The Bullet Point Version

* Makoto Kato is awesome.

* CJK support is landed.

* It shows up in nightlies tomorrow.

* You need to delete global-messages-db.sqlite in your profile directory
to start using the new CJK support.

* Deleting global-messages-db.sqlite will cause gloda to reindex all of
your messages again.  This will use a lot of CPU and take a while, and
it is not without bugs...

* We will automatically delete global-messages-db.sqlite for you when we
land the fixes for the bugs mentioned in the previous bullet point.

* Read the detailed version before reporting any bugs.


== The Detailed Version

Thanks to Makoto Kato we have just landed a fix for bug 472764
(https://bugzilla.mozilla.org/show_bug.cgi?id=472764) providing global
search support for CJK text.

In short, the global search in Thunderbird 3.0 previously was limited to
searching text that was separated by whitespace or punctuation.  When a
run of CJK (http://en.wikipedia.org/wiki/CJK) characters was
encountered, the code would see these as a single giant word.  The
result was that, for all intents and purposes, it was impossible to
search messages containing CJK text.

The new patch changes the logic so that it recognizes CJK characters and
indexes them specially.  Each pair of CJK characters (a 'bi-gram') is
emitted as a 'word'.  There must be at least two CJK characters in a row
for them to be emitted.  Normal whitespace or punctuation[1] is
significant, so a single CJK character on its own cannot be searched for.

Even though messages are indexed as 'bi-grams', you can search for a CJK
string of any length, even one character.  (We automatically convert a
query for a single character "z" into a wildcarded "z*".)  The previous
caveat related to bi-gram indexing still applies however; "z" must have
been found at the start of the bi-gram, which means it has to have been
followed by some other CJK character.  ("z*" finds "za" but not "xyz".)

Unless you manually delete the global-messages-db.sqlite file in your
profile directory you will not be using the new CJK support.  You should
only delete the file when Thunderbird is not running and you should
delete only that file.  Thunderbird will re-create the database and
enable CJK support when you next start it.  The global database indexing
will take a while but it should not interfere with your use of
Thunderbird or the machine.

There are still bugs in the indexing process; some messages may not get
indexed.  If a message does not get indexed, you cannot search for it.
You can tell if a message is actually indexed by installing the
glodaquilla extension by the excellent R Kent James and adding the
"gloda id" and "gloda dirty" columns to your thread pane.  The message
is probably indexed if "gloda id" has a value and "gloda dirty" has a
value other than 2 and the activity manager does not indicate indexing
is in progress.  The add-on can be found here:
https://addons.mozilla.org/en-US/thunderbird/addon/9873

The indexing problems are being worked and fixes should happen in the
near future.  Once those fixes land we will automatically purge the
global-messages-db.sqlite so everyone gets the benefits of those fixes
plus the CJK support.  We are not doing it for this change because we do
not want to reset the databases of people who are not interested in
testing the CJK support.

There are still very real limitations to the full-text search
functionality especially involving non-ASCII non-CJK characters like
letters with accents.  There is a little bit of time left for code fixes
if there are people out there who can provide patches in the very near
future.  The interesting bits can be found in:
http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c

Before filing any bugs, especially any bugs about search failing to find
any messages, please make sure the message is indexed.  You can do this
using the previously mentioned method.  A good trick to make sure a
message is indexed and to make a good test case is to send yourself an
e-mail with the search string in it when the indexing process is
inactive.  Once you receive the message and read it, star it.  Then do a
global search for it again.  If it doesn't show up in the results, you
can use the "File... Save as... File" menu option to save it to disk.
You can then attach that e-mail to a bugzilla bug.  Obviously, don't
choose text that you don't want made public.

Andrew

1: Unicode provides for many different types of whitespace characters.
We treat characters that are ASCII space characters or are analogous to
ASCII space characters as whitespace.  There are some special variants
that I'm not sure we actually know quite what to do with; we may luck
out and be doing the right thing or have it completely wrong.
_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Thunderbird 3.0 global / full-text search support for CJK languages landed, will show up in nightlies tomorrow, requires a new database.

Tim Chien (MozTW)
Kudos to Makoto Kato!!

I will surely pass this information to our local community.

Tim
Mozilla Taiwan Community

On Wed, Oct 21, 2009 at 10:58 AM, Andrew Sutherland
<[hidden email]> wrote:

> == The Bullet Point Version
>
> * Makoto Kato is awesome.
>
> * CJK support is landed.
>
> * It shows up in nightlies tomorrow.
>
> * You need to delete global-messages-db.sqlite in your profile directory to
> start using the new CJK support.
>
> * Deleting global-messages-db.sqlite will cause gloda to reindex all of your
> messages again.  This will use a lot of CPU and take a while, and it is not
> without bugs...
>
> * We will automatically delete global-messages-db.sqlite for you when we
> land the fixes for the bugs mentioned in the previous bullet point.
>
> * Read the detailed version before reporting any bugs.
>
>
> == The Detailed Version
>
> Thanks to Makoto Kato we have just landed a fix for bug 472764
> (https://bugzilla.mozilla.org/show_bug.cgi?id=472764) providing global
> search support for CJK text.
>
> In short, the global search in Thunderbird 3.0 previously was limited to
> searching text that was separated by whitespace or punctuation.  When a run
> of CJK (http://en.wikipedia.org/wiki/CJK) characters was encountered, the
> code would see these as a single giant word.  The result was that, for all
> intents and purposes, it was impossible to search messages containing CJK
> text.
>
> The new patch changes the logic so that it recognizes CJK characters and
> indexes them specially.  Each pair of CJK characters (a 'bi-gram') is
> emitted as a 'word'.  There must be at least two CJK characters in a row for
> them to be emitted.  Normal whitespace or punctuation[1] is significant, so
> a single CJK character on its own cannot be searched for.
>
> Even though messages are indexed as 'bi-grams', you can search for a CJK
> string of any length, even one character.  (We automatically convert a query
> for a single character "z" into a wildcarded "z*".)  The previous caveat
> related to bi-gram indexing still applies however; "z" must have been found
> at the start of the bi-gram, which means it has to have been followed by
> some other CJK character.  ("z*" finds "za" but not "xyz".)
>
> Unless you manually delete the global-messages-db.sqlite file in your
> profile directory you will not be using the new CJK support.  You should
> only delete the file when Thunderbird is not running and you should delete
> only that file.  Thunderbird will re-create the database and enable CJK
> support when you next start it.  The global database indexing will take a
> while but it should not interfere with your use of Thunderbird or the
> machine.
>
> There are still bugs in the indexing process; some messages may not get
> indexed.  If a message does not get indexed, you cannot search for it. You
> can tell if a message is actually indexed by installing the glodaquilla
> extension by the excellent R Kent James and adding the "gloda id" and "gloda
> dirty" columns to your thread pane.  The message is probably indexed if
> "gloda id" has a value and "gloda dirty" has a value other than 2 and the
> activity manager does not indicate indexing is in progress.  The add-on can
> be found here: https://addons.mozilla.org/en-US/thunderbird/addon/9873
>
> The indexing problems are being worked and fixes should happen in the near
> future.  Once those fixes land we will automatically purge the
> global-messages-db.sqlite so everyone gets the benefits of those fixes plus
> the CJK support.  We are not doing it for this change because we do not want
> to reset the databases of people who are not interested in testing the CJK
> support.
>
> There are still very real limitations to the full-text search functionality
> especially involving non-ASCII non-CJK characters like letters with accents.
>  There is a little bit of time left for code fixes if there are people out
> there who can provide patches in the very near future.  The interesting bits
> can be found in:
> http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c
>
> Before filing any bugs, especially any bugs about search failing to find any
> messages, please make sure the message is indexed.  You can do this using
> the previously mentioned method.  A good trick to make sure a message is
> indexed and to make a good test case is to send yourself an e-mail with the
> search string in it when the indexing process is inactive.  Once you receive
> the message and read it, star it.  Then do a global search for it again.  If
> it doesn't show up in the results, you can use the "File... Save as... File"
> menu option to save it to disk. You can then attach that e-mail to a
> bugzilla bug.  Obviously, don't choose text that you don't want made public.
>
> Andrew
>
> 1: Unicode provides for many different types of whitespace characters. We
> treat characters that are ASCII space characters or are analogous to ASCII
> space characters as whitespace.  There are some special variants that I'm
> not sure we actually know quite what to do with; we may luck out and be
> doing the right thing or have it completely wrong.
> _______________________________________________
> dev-l10n mailing list
> [hidden email]
> https://lists.mozilla.org/listinfo/dev-l10n
>
_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Thunderbird 3.0 global / full-text search support for CJK languages landed, will show up in nightlies tomorrow, requires a new database.

Channy Yun-3
In reply to this post by Andrew Sutherland-3
Hi, all.

According to testing by my peer Jungkyun, Korean search was successful done.

It took one hour to index 5,000 emails and it works in search.

Thanks,

Channy
---------------------
http://www.linkedin.com/in/channy

* Biomedical Knowledge Engineering Laboratory http://bike.snu.ac.kr
* Daum Developers Network & Affiliates http://dna.daum.net


2009/10/22 Gen Kanai <[hidden email]>

> Channy,
>
> Can you make sure to please test these builds with Hangul?
>
> We are seeing no problems with Japanese but I am not sure if there has been
> any testing with Korean.
>
> Thank you,
>
> Gen
>
> Begin forwarded message:
>
>  From: Andrew Sutherland <[hidden email]>
>> Date: October 21, 2009 11:58:28 AM JST
>> To: [hidden email]
>> Subject: Thunderbird 3.0 global / full-text search support for CJK
>> languages landed, will show up in nightlies tomorrow, requires a new
>> database.
>>
>>
>> == The Bullet Point Version
>>
>> * Makoto Kato is awesome.
>>
>> * CJK support is landed.
>>
>> * It shows up in nightlies tomorrow.
>>
>> * You need to delete global-messages-db.sqlite in your profile directory
>> to start using the new CJK support.
>>
>> * Deleting global-messages-db.sqlite will cause gloda to reindex all of
>> your messages again.  This will use a lot of CPU and take a while, and it is
>> not without bugs...
>>
>> * We will automatically delete global-messages-db.sqlite for you when we
>> land the fixes for the bugs mentioned in the previous bullet point.
>>
>> * Read the detailed version before reporting any bugs.
>>
>>
>> == The Detailed Version
>>
>> Thanks to Makoto Kato we have just landed a fix for bug 472764 (
>> https://bugzilla.mozilla.org/show_bug.cgi?id=472764) providing global
>> search support for CJK text.
>>
>> In short, the global search in Thunderbird 3.0 previously was limited to
>> searching text that was separated by whitespace or punctuation.  When a run
>> of CJK (http://en.wikipedia.org/wiki/CJK) characters was encountered, the
>> code would see these as a single giant word.  The result was that, for all
>> intents and purposes, it was impossible to search messages containing CJK
>> text.
>>
>> The new patch changes the logic so that it recognizes CJK characters and
>> indexes them specially.  Each pair of CJK characters (a 'bi-gram') is
>> emitted as a 'word'.  There must be at least two CJK characters in a row for
>> them to be emitted.  Normal whitespace or punctuation[1] is significant, so
>> a single CJK character on its own cannot be searched for.
>>
>> Even though messages are indexed as 'bi-grams', you can search for a CJK
>> string of any length, even one character.  (We automatically convert a query
>> for a single character "z" into a wildcarded "z*".)  The previous caveat
>> related to bi-gram indexing still applies however; "z" must have been found
>> at the start of the bi-gram, which means it has to have been followed by
>> some other CJK character.  ("z*" finds "za" but not "xyz".)
>>
>> Unless you manually delete the global-messages-db.sqlite file in your
>> profile directory you will not be using the new CJK support.  You should
>> only delete the file when Thunderbird is not running and you should delete
>> only that file.  Thunderbird will re-create the database and enable CJK
>> support when you next start it.  The global database indexing will take a
>> while but it should not interfere with your use of Thunderbird or the
>> machine.
>>
>> There are still bugs in the indexing process; some messages may not get
>> indexed.  If a message does not get indexed, you cannot search for it. You
>> can tell if a message is actually indexed by installing the glodaquilla
>> extension by the excellent R Kent James and adding the "gloda id" and "gloda
>> dirty" columns to your thread pane.  The message is probably indexed if
>> "gloda id" has a value and "gloda dirty" has a value other than 2 and the
>> activity manager does not indicate indexing is in progress.  The add-on can
>> be found here: https://addons.mozilla.org/en-US/thunderbird/addon/9873
>>
>> The indexing problems are being worked and fixes should happen in the near
>> future.  Once those fixes land we will automatically purge the
>> global-messages-db.sqlite so everyone gets the benefits of those fixes plus
>> the CJK support.  We are not doing it for this change because we do not want
>> to reset the databases of people who are not interested in testing the CJK
>> support.
>>
>> There are still very real limitations to the full-text search
>> functionality especially involving non-ASCII non-CJK characters like letters
>> with accents.  There is a little bit of time left for code fixes if there
>> are people out there who can provide patches in the very near future.  The
>> interesting bits can be found in:
>>
>> http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c
>>
>> Before filing any bugs, especially any bugs about search failing to find
>> any messages, please make sure the message is indexed.  You can do this
>> using the previously mentioned method.  A good trick to make sure a message
>> is indexed and to make a good test case is to send yourself an e-mail with
>> the search string in it when the indexing process is inactive.  Once you
>> receive the message and read it, star it.  Then do a global search for it
>> again.  If it doesn't show up in the results, you can use the "File... Save
>> as... File" menu option to save it to disk. You can then attach that e-mail
>> to a bugzilla bug.  Obviously, don't choose text that you don't want made
>> public.
>>
>> Andrew
>>
>> 1: Unicode provides for many different types of whitespace characters. We
>> treat characters that are ASCII space characters or are analogous to ASCII
>> space characters as whitespace.  There are some special variants that I'm
>> not sure we actually know quite what to do with; we may luck out and be
>> doing the right thing or have it completely wrong.
>> _______________________________________________
>> dev-l10n mailing list
>> [hidden email]
>> https://lists.mozilla.org/listinfo/dev-l10n
>>
>
>
_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n