Q&D python based parser that extracts the articles of German nouns from the public wiktionary dump
- download the latest copy of the German wiktionary's dump: curl http://dumps.wikimedia.org/dewiktionary/latest/dewiktionary-latest-pages-meta-current.xml.bz2
- extract it to data/dewiktionary.xml
- run python parse.py | tee data/articles.csv
A log of non extractable nouns is stored in data/articles.log
- Find all words that do not contain "der", "die" or "das":
cat articles.csv | grep -v ",das" | grep -v ",die" | grep -v ",der"
- Look at the log and implement proper handling of special cases (Alternative Schreibweisen, Adjektivische Deklination, ...)