(For the impatient ones: go directly to the Download section. But I encourage you to read this article anyway...)
A method for conflation of different inflected word forms is an important component of many Information Retrieval systems. It helps to improve the system's recall and can significantly reduce the index size. This is especially true for highly-inflectional languages like those from the Slavic language family (Czech, Slovak, Polish, Russian, Bulgarian, etc).
This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables. The stemmer code is taken virtually unchanged from the Egothor project.
You can download both the Java software distribution and the stemmer tables prepared using an extensive corpus of Polish language (see details below).
This work is available under Apache-style Open Source license - the stemmer code is covered by Egothor License, the tables and other additions are covered by Apache License 2.0. Both licenses allow to use the code in Open Source as well as commercial (closed source) projects.
A short explanation is in order about the terminology used in this text.
In the following sections I make a distinction between stem and lemma.
Lemma is a base grammatical form (dictionary form, headword) of a word. Lemma is an existing, grammatically correct word in some human language.
Stem on the other hand is just a unique token, not necessarily making any sense in any human language, but which can serve as a unique label instead of lemma for the same set of inflected forms. Quite often stem is referred to as a "root" of the word - which is incorrect and misleading (stems sometimes have very little to do with the linguistic root of a word, i.e. a pattern found in a word which is common to all inflected forms or within a family of languages).
For an IR system stems are usually sufficient, for a morphological analysis system obviously lemmas are a must. In practice, various stemmers produce a mix of stems and lemmas, as is the case with the stemmer described here. Additionally, for some languages, which use suffix-based inflection rules many stemmers based on suffix-stripping will produce a large percentage of stems equivalent to lemmas. This is however not the case for languages with complex, irregular inflection rules (such as Slavic languages) - here simplistic suffix-stripping stemmers produce very poor results.
Lemmatization is a process of finding the base, non-inflected form
of a word. The result of lemmatization is a correct existing word,
often in nominative case for nouns and infinitive form for verbs. A
given inflected form may correspond to several lemmas (e.g. "found"
-> find, found) - the correct choice depends on the context.
Stemming is concerned mostly with finding a unique "root" of a word,
which not necessarily results in any existing word or lemma. The
quality of stemming is measured by the rate of collisions (overstemming
- which causes words with different lemmas to be incorrectly conflated
into one "root"), and the rate of superfluous word "roots"
(understemming - which assigns several "roots" to words with the same
lemma).
Both stemmer and lemmatizer can be implemented in various ways. The two
most common approaches are:
(to be completed...)
The following Polish corpora have been used:
This step was the most time-consuming - and it would probably be
even more tedious and difficult if not for the
help of
Python. The source texts had to be
brought to a common encoding (UTF-8) - some of them used quite ancient
encodings like Mazovia or DHN - and then scripts were written to
collect all lemmas and
inflected forms from the source texts. In cases when the source text
was not
tagged,
I used the SAM analyzer to produce lemmas. In cases of ambiguous
lemmatization I decided to put references to inflected forms from all
base forms.
All grammatical categories were allowed to appear in the corpus,
i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
corpus consisted of roughly 87,000+ inflection sets, i.e. each set
consisted of one base form (lemma) and many inflected forms. However,
because of the nature of the training method I restricted these sets to
include only those where there were at least 4 inflected forms. Sets
with 3 or less inflected forms were removed, so that the final corpus
consisted of ~69,000 unique sets, which in turn contained ~1.5 mln
inflected forms.
I tested the stemmer tables produced using the implementation described above. The following sections give some details about the testing setup.
The testing procedure was as follows:
The following table summarizes test results for varying sizes of training samples. The meaning of the table columns is described below:
Training sets | Testing forms | Stem OK | Lemma OK | Missing | Stem Bad | Lemma Bad | Table size [B] |
---|---|---|---|---|---|---|---|
100 | 1022985 | 842209 | 593632 | 172711 | 22331 | 256642 | 28438 |
200 | 1022985 | 862789 | 646488 | 153288 | 16306 | 223209 | 48660 |
500 | 1022985 | 885786 | 685009 | 130772 | 14856 | 207204 | 108798 |
700 | 1022985 | 909031 | 704609 | 107084 | 15442 | 211292 | 139291 |
1000 | 1022985 | 926079 | 725720 | 90117 | 14941 | 207148 | 183677 |
2000 | 1022985 | 942886 | 746641 | 73429 | 14903 | 202915 | 313516 |
5000 | 1022985 | 954721 | 759930 | 61476 | 14817 | 201579 | 640969 |
7000 | 1022985 | 956165 | 764033 | 60364 | 14620 | 198588 | 839347 |
10000 | 1022985 | 965427 | 775507 | 50797 | 14662 | 196681 | 1144537 |
12000 | 1022985 | 967664 | 782143 | 48722 | 14284 | 192120 | 1313508 |
15000 | 1022985 | 973188 | 788867 | 43247 | 14349 | 190871 | 1567902 |
17000 | 1022985 | 974203 | 791804 | 42319 | 14333 | 188862 | 1733957 |
20000 | 1022985 | 976234 | 791554 | 40058 | 14601 | 191373 | 1977615 |
I also measured the time to produce a stem (which involves
traversing a trie,
retrieving a patch command and applying the patch command to the input
string).
On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
HotSpot),
for tables ranging in size from 1,000 to 20,000 cells, the time to
produce a
single stem varies between 5-10 microseconds.
This means that the stemmer can process up to 200,000 words per second, an
outstanding result when compared to other stemmers (Morfeusz - ~2,000
w/s, FormAN (MS Word analyzer) - ~1,000 w/s).
The package contains a class org.getopt.stempel.Benchmark
,
which you can use to produce reports
like the one below:
--------- Stemmer benchmark report: -----------
Stemmer table: /res/tables/stemmer_2000.out
Input file: ../test3.txt
Number of runs: 3
RUN NUMBER: 1 2 3
Total input words 1378176 1378176 1378176
Missed output words 112 112 112
Time elapsed [ms] 6989 6940 6640
Hit rate percent 99.99% 99.99% 99.99%
Miss rate percent 00.01% 00.01% 00.01%
Words per second 197192 198584 207557
Time per word [us] 5.07 5.04 4.82
The results of these tests are very encouraging. It seems that using the training corpus and the stemming algorithm described above results in a high-quality stemmer useful for most applications. Moreover, it can also be used as a better than average lemmatizer.
Both the author of the implementation (Leo Galambos, <leo.galambos AT egothor DOT org>) and the author of this compilation (Andrzej Bialecki <ab AT getopt DOT org>) would appreciate any feedback and suggestions for further improvements.
You can download the full source code here:
NOTE: due to licensing restrictions I am unable to provide downloads for the original corpora.
You will need Jakarta Ant to build the JAR file. You can also download a pre-compiled JAR here (includes stemming tables):
NEW: as an experiment in using GCJ, here you can download a
native Windows binary, which executes the main() method of
org.getopt.stempel.Stemmer
class.
JavaDoc API documentation can be found here.
You can also download it as a ZIP archive.
Most of the code is covered by Egothor Open Source License, an Apache-style license. The rest of the code is covered by the Apache License 2.0.
The Open Source distribution contains pre-built stemmer tables
trained with
a random well-balanced sample of up to 2000 sets. Other tables are
available
commercially for a modest price - please contact Andrzej Bialecki
<ab AT getopt DOT
org> for more details.