Skip to content

Neural network based lemmatizer for Finnish language

Notifications You must be signed in to change notification settings

jmyrberg/finnlem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

finnlem

finnlem is a neural network based lemmatizer model for Finnish language.

A trained neural network can map given Finnish words into their base form with quite reasonable accuracy. These are examples of the model output:

[ORIGINAL] --> [BASE FORM]
Kiinalaisessa --> kiinalainen
osinkotulojen --> osinko#tulo	
Rajoittavalla --> rajoittaa
multimediaopetusmateriaalia -->	multi#media#opetus#materiaali
ei-rasistisella	--> ei-rasistinen

The model is a tensorflow implementation of a sequence-to-sequence (Seq2Seq) recurrent neural network model. This repository contains the code and data needed for training and making predictions with the model. The datasets contain over 2M samples in total.

Features

Tensorboard

  • Easy-to-use Python wrapper for sequence-to-sequence modeling
  • Automatical session handling, model checkpointing and logging
  • Support for tensorboard
  • Sequence-to-sequence model features: Bahdanau and Luong attention, residual connections, dropout, beamsearch decoding, ...

Installation

You should have the latest versions for (as of 7/2017):

  • keras
  • nltk
  • numpy
  • pandas
  • tensorflow (1.3.0 or greater, with CUDA 8.0 and cuDNN 6.0 or greater)
  • unidecode
  • sacremoses (see issue regarding this)

After this, clone this repository to your local machine.

Update 10.9.2020: You could also try to first clone and then run pip install -r requirements.txt at the root of this repository. This will install the latest versions of the required packages automatically, but notice that the very latest versions of some of the packages might nowadays be incompatible with the source code provided here. Feel free to make a pull request with fixed versions of the packages, in case you manage to run the source code successfully :)

Example usage

Three-steps are required in order to get from zero to making predictions with a trained model:

  1. Dictionary training: Dictionary is created from training documents, which are processed the same way as the Seq2Seq model inputs later on. Dictionary handles vocabulary/integer mappings required by Seq2Seq.
  2. Model training: Seq2Seq model is trained in batches with training documents that contain source and target.
  3. Model decoding: Unseen source documents are fed into Seq2Seq model, which makes predictions on the target.

The following is a simple example of using some of the features in the Python API. See more detailed descriptions of functions and parameters available from the source code documentation.

1. Dictionary training - fit a dictionary with default parameters

from dictionary import Dictionary

# Documents to fit in dictionary
docs = ['abcdefghijklmnopqrstuvwxyz','åäö','@?*#-']

# Create a new Dictionary object
d = Dictionary()

# Fit characters of each document
d.fit(docs)

# Save for later usage
d.save('./data/dictionaries/lemmatizer.dict')

2. Model training - create and train a Seq2Seq model with default parameters

from model_wrappers import Seq2Seq

# Create a new model
model = Seq2Seq(model_dir='./data/models/lemmatizer,
				dict_path='./data/dictionaries/lemmatizer.dict')

# Create some documents to train on
source_docs = ['koira','koiran','koiraa','koirana','koiraksi','koirassa']*128
target_docs = ['koira','koira','koira','koira','koira','koira']*128

# Train 100 batches, save checkpoint every 25th batch
for i in range(100):
	loss,global_step = model.train(source_docs, target_docs, save_every_n_batch=25)
	print('Global step %d loss: %f' % (global_step,loss))

3. Model decoding - make predictions on test data

test_docs = ['koiraa','koirana','koiraksi']
pred_docs = model.decode(test_docs)
print(pred_docs) # --> [['koira'],['koira'],['koira']]

The following demonstrates the usage of command line for training and predicting from files.

1. Dictionary training - fit a dictionary with default parameters

python -m dict_train
		--dict-save-path ./data/dictionaries/lemmatizer.dict
		--dict-train-path ./data/dictionaries/lemmatizer.vocab

The dictionary train path file(s) should contain one document per line (example).

2. Model training - create and train a Seq2Seq model with default parameters

python -m model_train
		--model-dir ./data/models/lemmatizer
		--dict-path ./data/dictionaries/lemmatizer.dict
		--train-data-path ./data/datasets/lemmatizer_train.csv

The model train and validation data path file(s) should contain one source and target document per line, separated by a comma (example).

3. Model decoding - make predictions on test data

python -m model_decode
		--model-dir ./data/models/lemmatizer
		--test-data-path ./data/datasets/lemmatizer_test.csv
		--decoded-data-path ./data/decoded/lemmatizer_decoded.csv

The model test data path file(s) should contain either:

  • one source document per line, or
  • one source and target document per line, separated by a comma (example)

Extensions

  • To use tensorboard, run command python -m tensorflow.tensorboard --logdir=model_dir, where model_dir is the Seq2Seq model checkpoint folder.

  • The model was originally created for summarizing the Finnish news, by using news contents as the sources, and news titles as the targets. This proved to be quite a difficult task due to rich morphology of Finnish language, and lack of computational resources. My first approach for tackling the morphology was to use the base forms for each word, which is what the model in this package does by default. However, using this model to convert every word to their base form ended up being too slow to be used as an input for the second model in real time.

    In the end, I decided to try the Finnish SnowballStemmer from nltk in order to get the "base words", and started training the model with 100k vocabulary. After 36 hours of training with loss decreasing very slowly, I decided to stop, and keep this package as a character-level lemmatizer. However, in model_wrappers.py, there is a global variable DOC_HANDLER_FUNC, which enables one to change the preprocessing method easily from characters to words by setting DOC_HANDLER_FUNC='WORD'. Try changing the variable, and/or write your own preprocessing function doc_to_tokens, if you'd like to experiment with the word-level model.

Acknowledgements and references


Jesse Myrberg ([email protected])