Skip to content

Tools to extract text from Finnish Wikipedia and feed it to Voikko.

License

Notifications You must be signed in to change notification settings

tuomassalo/wikipedia-voikko-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikipedia-voikko-analyzer

Tools to extract text from Finnish Wikipedia and feed it to Voikko.

It uses WikiExtractor to convert mediawiki format to plaintext. Only the actual text content is analyzed, and all templates are ignored.

This repository includes a script, find-unknowns.py, that finds all words that Voikko cannot recognize.

Prerequisites

  • Docker and docker-compose.

How to use

The process will write about 9000 files named work/output/unknowns-*.txt. Each file contains the unrecognized words of up to 100 Wikipedia articles. The file format:

<pageid> <occurrence_count> <original_word_form>

NB: To resolve page from pageid, use the URL https://fi.wikipedia.org/?curid=<pageid>, e.g. https://fi.wikipedia.org/?curid=42.

A quick (and dirty) script that finds recurring words from the output file:

perl -C -walne 'print $F[2] if length($F[2])>6' unknowns-*.txt | sort | perl -walne 'BEGIN{$p='X';} if(index($_, $p) == 0 and length() < length($p)+6) {push @o, $_} else { if(@o > 100) {print ""; print for sort @o}; $p=substr($_, 0, length($_)-3); @o=()}' | uniq -c

First, find all unknown words over 6 characters long. Then sort the output. For each word, strip the last three characters and see if the next lines contain the same prefix. If yes, and if more thatn 100 occurrences were found, print all the occurrences.

How to do something else with this

  • Copy find-unknowns.py to e.g. find-compound-words.py

  • Modify to your needs

  • Run docker exec -ti wikipedia-voikko-analyzer_bulkvoikko_1 ./run.sh compound-words

How to analyze a single word

docker exec -ti wikipedia-voikko-analyzer_bulkvoikko_1 python3 -c 'import json; from libvoikko import Voikko; print(json.dumps(Voikko("fi").analyze("alusta"), indent=2, sort_keys=True))'

How to change the vocabulary

See GENLEX_OPTS in Dockerfile, and see possible values in https://github.com/voikko/corevoikko/blob/master/tools/bin/voikko-build-dicts.

About

Tools to extract text from Finnish Wikipedia and feed it to Voikko.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published