Thumbnail for the article 'What is HTML?'

Multi-class text classification for the Icelandic language

To keep up with the modern day era we need a simple automatic way to sort information. Many text classification algorithms relay on word search, where the machine looks through the writing for a key word. Once, it finds the keyword it tags the text with the appropriate class. However, this method is limited as it can not understand context (keep in mind that the explanation above is way over simplified).

Thankfully, due to recent advancements in Machine Learning technologies we can finally teach computers to classify text accurately, but these technologies have not yet been developed for the Icelandic language. Thus, to prove to everyone that we should invest more into ML technologies, and that anything is possible, I trained a ML model to make word embeddings for the Icelandic language and use them to classify Icelandic text.

To train the model I first needed to gather lots of data, specifically Icelandic writing that had already be tagged or sorted. Initially I contacted the Árni Magnússon's Institute for Icelandic Studies, an independently funded academic research institute at the University of Iceland. They happily gave me access to lots of Icelandic literature, but it was too specialised. The model had a hard time generalising and classifying new, unseen text. For this reason, I turned towards the local news agencies. They were the perfect choice! The news articles were written in modern Icelandic, about current events, and were categorised in a generalised manner with hundreds of articles in each category.

I got to work, and programmed a bot that would go through online Icelandic news articles and download their contents onto my computer. After letting it run for an hour I had 152 articles belonging to three classes: "local", "global", and "cars". The classes contained local Icelandic news, global news and news from car dealerships as these were the most popular categories (each article had on average 600 words).

As stated above the key feature of the Machine Learning model is the word embeddings it creates and uses. Word embedding is a way to transform words into numbers, this is crucial since ML models prefer working with numbers. One can think of word embeddings as coordinates, and the position of the word as its meaning. For example a house cat and a dog can be represented on a 2D plane where the x-axis represents their size and the y-axis the loyalty towards their owner. See the figure below.

Intuition of word embeddings

As you can see, we can quickly compare the two pets. The dog is more loyal towards the owner and the cat is smaller. Each dimension adds information and new ways to connect and compare words to each other. I configured the word embeddings to have 16 dimensions and a max vocabulary size of 10000 words. After training the embeddings they looked something like this (for simplicity sake, I reduced the dimensions of the visualisation down to 3).

Visualisation of word embeddings

As seen in the image above, the words form three big groups/categories. If majority of the words in the text exist in or around a group it belongs to its category. The second layer of the model learns what this majority is and when there are exceptions.

In conclusion I built a basic model that can classify Icelandic writing into three categories by training a Machine Learning model to generate Icelandic word embeddings from scratch and to make use of them to predict the meaning behind the words, effectively classifying the text. With only 152 articles and a vocabulary size of 8156 words the model reached accuracy off 89.99% on new input text. However, if I had more computing power and patience I could have let the scraper collect more data from the news articles and let the model train better. Then, it could learn to classify text into new categories and build up a larger vocabulary size, and hence become truly useful.