Stempel - Algorithmic Stemmer for Polish Language

"The aim is separation of the stemmer execution code from the data structures [...]. In other words, a static algorithm configurable by data must be developed. The word transformations that happen in the stemmer must be then encoded to the data tables.

The tacit input of our method is a sample set (a so-called dictionary) of words (as keys) and their stems. Each record can be equivalently stored as a key and the record of key's transformation to its respective stem. The transformation record is termed a patch command (P-command). It must be ensured that P-commands are universal, and that P-commands can transform any word to its stem. Our solution[6,8] is based on the Levenstein metric [10], which produces P-command as the minimum cost path in a directed graph.

One can imagine the P-command as an algorithm for an operator (editor) that rewrites a string to another string. The operator can use these instructions (PP-command's): removal - deletes a sequence of characters starting at the current cursor position and moves the cursor to the next character. The length of this sequence is the parameter; insertion - inserts a character ch, without moving the cursor. The character ch is a parameter; substitution - rewrites a character at the current cursor position to the character ch and moves the cursor to the next character. The character ch is a parameter; no operation (NOOP) - skip a sequence of characters starting at the current cursor position. The length of this sequence is the parameter.

The P-commands are applied from the end of a word (right to left). This assumption can reduce the set of P-command's, because the last NOOP, moving the cursor to the end of a string without any changes, need not be stored."

Training sets	Testing forms	Stem OK	Lemma OK	Missing	Stem Bad	Lemma Bad	Table size [B]
100	1022985	842209	593632	172711	22331	256642	28438
200	1022985	862789	646488	153288	16306	223209	48660
500	1022985	885786	685009	130772	14856	207204	108798
700	1022985	909031	704609	107084	15442	211292	139291
1000	1022985	926079	725720	90117	14941	207148	183677
2000	1022985	942886	746641	73429	14903	202915	313516
5000	1022985	954721	759930	61476	14817	201579	640969
7000	1022985	956165	764033	60364	14620	198588	839347
10000	1022985	965427	775507	50797	14662	196681	1144537
12000	1022985	967664	782143	48722	14284	192120	1313508
15000	1022985	973188	788867	43247	14349	190871	1567902
17000	1022985	974203	791804	42319	14333	188862	1733957
20000	1022985	976234	791554	40058	14601	191373	1977615

Stempel - Algorithmic Stemmer for Polish Language

Introduction

Terminology

Background

Algorithm and implementation

Corpus

Testing

Testing procedure

Test results

Summary

Download

License

Bibliography