Building new open source spellcheck dictionary

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Building new open source spellcheck dictionary

Michal Stanke-2
Hi.

We have realized, that the current spellcheck dictionary for Czech
really sucks. It's over 10 years old, generated from a specific set of
works and without any maintainer. We do not even know much about the
original author... That means the dictionary includes archaisms while
producing a lot of false positives at the same time.

We have got in touch with the Institute for the Czech language and
looking for some sources of data available to build a new dictionary,
and hopefully not just for spellchecking, but also grammar. People from
the institute have linked us to some other university institutions, but
also mentioned Merlin (merlin-platform.eu), Eurovoc (eurovoc.europa.eu)
or CLARIN (clarin.eu).

Do you have any experience with building a similar dictionary or with
the projects mentioned above, if they may have useful data that are
worth further investigation?

Thanks,
--
Michal Stanke

_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Building new open source spellcheck dictionary

Стоян Димитров
For building good spellchecker you'll need the biggest possible amount of text in convenient format, say wikipedia archive (but not only, read on). Also you'll need texts for different subjects, say political, iconomical, children books, biology, chemistry… you get it. The greater the diversity the greater the amount of subjects the spellchecker will cover making it usefull for more people. After collecting all those texts (called corpus) you'll have to process it to form one hudge list consisting only of one word per line. From that list you build another one that have the frequency of each word (how frequently given word is used in the corpus). Top 30% of that list is your spellchecker content. Then comes the fun part. It really is language dependant, but in a nutshell you separate this list of words by some common criteria to form smaller lists, say part of speech. The point is to form those smaller and smaller lists to have common suffixes and/or preffixes. Having those will be easier to build the hunspell dictionaly that Firefox uses.

I guess there is a shortcut somewhere.

On 7 May 2017 14:19:04 EEST, Michal Stanke <[hidden email]> wrote:

>Hi.
>
>We have realized, that the current spellcheck dictionary for Czech
>really sucks. It's over 10 years old, generated from a specific set of
>works and without any maintainer. We do not even know much about the
>original author... That means the dictionary includes archaisms while
>producing a lot of false positives at the same time.
>
>We have got in touch with the Institute for the Czech language and
>looking for some sources of data available to build a new dictionary,
>and hopefully not just for spellchecking, but also grammar. People from
>
>the institute have linked us to some other university institutions, but
>
>also mentioned Merlin (merlin-platform.eu), Eurovoc (eurovoc.europa.eu)
>
>or CLARIN (clarin.eu).
>
>Do you have any experience with building a similar dictionary or with
>the projects mentioned above, if they may have useful data that are
>worth further investigation?
>
>Thanks,
>--
>Michal Stanke
>
>_______________________________________________
>dev-l10n mailing list
>[hidden email]
>https://lists.mozilla.org/listinfo/dev-l10n

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Building new open source spellcheck dictionary

Michael Bauer-11
It depends, I would say, on the language, the available resources (incl.
yours) and how mature the existing spellchecker is.

We feed our spellchecker via an online dictionary from which we export
the data now and then to build a new version of the spellchecker. The
dictionary itself has a morphological generator which generates the
inflected forms for new entries which are then checked by an editor
(i.e. most forms are predictable but each is checked by a human to avoid
errors - but this still cuts down time hugely). This approach worked
well for us since we were going to do the dictionary project anyway but
it's a long and slow process. It took us about a year before we had a
spellchecker than was even vaguely usable. Collaborating with an
existing dictionary project might help but only if their data actually
contains the inflected forms the spellchecker needs, so it might not be
that workable.

But I think given the scenario you described, I would take a different
approach. If the spellchecker overall is fine but just has gaps, you
could just dump some modern Czech texts from newspapers etc into a
LibreOffice document, see which entries are underlined and add those to
the spellchecker as appropriate.

Unless the gaps are massive, I'm not sure if overall you'd save time by
building something elaborate to expand it. If the gaps are massive, then
you probably need to do something more elaborate.

If this is the spellchecker:
https://extensions.openoffice.org/en/project/czech-dictionary-pack-ceske-slovniky-cs-cz
then I think the manual approach should work. I put an article from a
Czech newspaper through and some inflected personal names aside (like
Penové and Mélenchona) kohabitace and kanidátkách seemed to only two
works missing. If that's an indication of the average rate at which
you'll get missing words, the manual approach ought to work.

You could also do the following: Either collect a web corpus of Czech
texts from the internet manually or in some automated way by crawling
the web, rank the words by occurrence and then basically diff the web
corpus against the spellchecker. This should spit out a file where the
most commonly used words which are missing from the spellchecker should
be at the top. Of course this will contains some (or a lot, depending on
how well people spell Czech online) noise but a human should be fairly
quick at going through this and discarding all the FP.

Just some ideas. Many ways of doing this.

Michael

> Do you have any experience with building a similar dictionary or with
> the projects mentioned above, if they may have useful data that are
> worth further investigation?

_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n