Icelandic word completion with stateful RNN models
June 26, 2022Why are there so few word completion systems for the Icelandic language? In this paper I set out to investigate natural language processing tools for Icelandic and try to develop a machine learning model for autocompleting Icelandic texts.
As of the date of this writing there are no freely available word completion system for the Icelandic language. Probably, because there are relatively few people who speak Icelandic. Nevertheless, there are around 370 000 people in Iceland, many off which will benefit from word completion models. The model could fix typos and speed up the writing process, with enough data it could even tailor itself to one’s writing style and suggest more personalized word completions.
With that in mind, I created my own word completion machine learning model. After thoroughly searching the internet for modern digital Icelandic text sets and finding nothing, I resorted to scrape 99 000+ words or around 620 000 characters from recent online news articles. See text example below:
The first layer in the model, the text vectorization layer processes the text and turns it into integer sequences, these sequences are learned by four RNN layers with GRU cells. During training the RNN layers look at 100 characters in the text at a time and predict what the next one will be. This gives it character level understanding of the text and is thus not as vulnerable towards typos as word level models but has a harder time learning patterns such as: nouns follow adjectives (in Icelandic). However, it converges much faster, since the training set is too small for the word level models to get reasonably accurate.
After training the ML model it reached 97% accuracy and can complete one’s words in real-time, using up to 100 previously typed characters to gain context. For example, when given the input “Með öðrum orðum” the model suggests “með öðrum orðum, það eru”
In addition, it can somewhat generate new original texts. Instead of feeding the model user input and ask it to finish the word or sentence, we can give it a random letter and ask it to finish the word, then feed it the generated word and ask it for the next, until the model has generated a sentence or even a paragraph.
When applying this technique, the model was able to generate some interesting texts such as:
“verði ekki fyrir ábyrgð og segja að það sem fyrir hrifa betir þau rekstur samfylkingarinnar”
“að bíla um eins og til að gæra alls og felgi í kærandi háska reyndi að vera fréttablaðið”
“þá svaraði Oddur: ‘við þeir segir að taki leyta fólk í kostnaði´ segir dag í hvort verður ekki um hverfis og sögundi á náttúrunni’”
The texts it generates are surprisingly readable and extremely diverse! With more training data and computer power the model could do a lot more than predict what word you are typing and what the next world could be. Increased training data would allow it to learn longer patterns, not just one’s that are 100 character long. Hence, it could generate proper sentences and suggest more accurate words.