This repository contains code for a prototype semantic tagger for Danish language texts, using linguistic information taken from Den Danske Begrebsordbog.
The goal of this project is to be able to annotate Danish texts in a way that is compuationally light, conceptually simple, and linguistically intutive. This will allow researchers in Arts, Humanities, and Social Science subjects to extract semantic information from their corpora of Danish texts.
Possible research questions this tool can be used to address:
- How does the distribution of a certain concept change over time?
- How are different concepts represented in different corpora- newspapers vs parliamentary discourse, for example?
- What associations can be found between genre/text type and the kinds of semantic categories which they contain?
- Are there any significant associations between semantic categories?
- Etc.
Den Danske Begrebsordbog (DDB for short) groups words together into 888 different semantic fields, an overview of which can be found here. These semantic fields have a unique identifier, while some words belong to multiple categories. For example, consider the following entry:
semantik | 11045505 | sb. | 12.012 Betydning | 13.18 Humaniora
Here we see a word (semantik); an ID number unique to this word (11045505); POS (sb.). There are then two possible meanings for this word, taken from two different semantic fields - 12.012 Betydning or 13.18 Humaniora. The tagger has two levels of granularity: a top-level category; and a sub-category. For example, the sub-category 12.012 Betydning has the top-level category 12 Tegn, meddelelse, sprog.
Note that DDB structure includes Danish part-of-speech tags (e.g. sb. or substantive). As part of a preprocessing step, all of the POS tags in DDB were converted to a Universal POS tagset (link).
The tagger can either be used in combination with a the Danish spaCy framework or the DaCy framework.
Disambigutation of polysemous word senses is an open problem in NLP, particularly for low resource languages such as Danish. To try to address this, the tagger uses Jaccard Distance as a method of infering the most likely category. This distance is calculated using the set of all words in context window ±5 words, as well as their parts-of-speech. The smaller the Jaccard Distance between the set of context words and the individual sets of possible categories for a target word, the more likely it is to be the correct sense of that word.
The disambiguatiom algorithm has the following steps:
- For each word, take all the other words in a window of ±5 words, along with their POS tag.
- For the target word, find all possible categories from the DBO hierarchy.
- For each possible category, calculate the Jaccard Distance between the set of word_POS in the context and the set of all word_POS in the top level category for each tag.
- Return an ordered list of three most likely tags based on lowest Jaccard Distance.
The algorithm uses the top-level category for disambigutation, rather than the sub-category. This is based on a linguistic intuition regarding the distribution of semantic categories. For any given word, it is unlikely that there will be many context words which belong to the exact same sub-category. However, sub-categories which belong to the same top-level category share a degree of semantic similarity.
Take for example, 13.005 Geometri, figur and 13.004 Matematik. Both are clearly related and in fact both belong to the top-level category 13 Videnskab. It thus seems likely to assume that the presence of these tags would likely be accompanied by more words from 13 Videnskab, rather than from the specfic sub-categories. This allows the algorithm to infer tags based on a looser set of semantic relations.
For some words, the possible tags may have identical Jaccard Distance scores. Thus, an additional algorithm for disambiguating duplicates is implemented. The algorithm uses three different methods, in a hierarchical manner (if 1. could not disambiguate tags, try 2., ...):
-
High-Level Disambiguation: Check the surrounding context of the word to see how many of the context words were assigned a tag belonging to each of the top-level categories. For instance, if a word is assigned
3.1 Studium, universitet
and8.14 Vogn
with duplicate Jaccard Distance scores, the number of context words assigned a tag belonging to the high-level category3
or8
are counted. -
Low-Level Disambiguation: Check the surrounding context of the word to see how many of the context words were assigned a tag belonging to each of the low-level categories. For instance, if a word is assigned
3.1 Studium, universitet
and8.14 Vogn
with duplicate Jaccard Distance scores, the number of context words assigned the low-level category3.1 Studium, universitet
or8.14 Vogn
are counted. -
Category-Size Disambiguation: Order tags with duplicate scores based on size of the category. For instance, if a word is assigned
3.1 Studium, universitet
and8.14 Vogn
with duplicate Jaccard Distance scores, the tags are ordered based on how many words the category contains in the DDB. (REF?)
(NB: For copyright reasons, the data from DBO cannot be included in this repository. These instructions assume you have the data, in a folder called dict).
First, create a virtual environment to work in. The required packages are defined in requirements.txt
. To install these requirements and also load the necessary Danish language models, you can use the install_requirements.txt
script.
# Example of installing environment
python3 -m venv env
source ./env/bin/activate
bash install_requirements.sh
One way to use the tagger is by running the script from the terminal to tag all files of text in a given directory. For this, you can save your data as plain text (.txt
) files in an input folder (by default /in
). Then, you can run the script from the root directory:
# Example with default parameters
python3 src/DDB_tagger.py
# Example with parameters
python3 src/DDB_tagger.py --input_directory other_in/ --only_tagged_results
When running the script, you can define the following parameters:
-
--input_directory
: str, optional, default:in/
Input directory to files to tag, will tag all files in directory. Path should be from root directory. -
--dict
: str, optional, default:dict/dict.pkl
Path to semantic dictionary. -
--da_model
: str, optional, default:spacy
Danish language model to use,spacy
ordacy
. -
--only_top3_results
: argument, optional, default:True
Use argument if results should contain all possible tags, by default only contains the top3. -
--only_tagged_results
: argument, optional, default:False
Use argument if results should only contain tags, by default also contain scores. -
--output_directory
: str, optional, default:out/
"Directory to save output files. Path should be from root directory, and directory should be existing. "
If things are working correctly, you should get some feedback in the terminal about the progress of the script. And that's it! Your output files are saved in folder called out (unless you defined a different output path).
Another way to use the tagger is by importing it in a Jupyter notebook. Here, you can either tag a string of text, or you can also use it to tag all files in a directory (same as calling it from the terminal, described above). Examples of how to import and use the tagger from a notebook are provided in the DDB_demo.ipynb
file.
At the moment, this requires Python ≥ 3.6 and only works on macOS and Linux.
From a linguistic perspective, there needs to be some evaluation of the accuracy of the tagger. Further word sense disambiguation methods can be explored.
From a technical perspective, the tagger should be refactored in order to be used as an module that can be imported into other programs.
Lastly, version of this software is currently being developed using Rust. This will cross compile to other operating systems and will require no runtime dependencies. Watch this space.
Det Danske Sprog- og Litteraturselskab have kindly allowed me access to the data behind Den Danske Begrebsordbog for the purposes of developing this prototype.
However, due to copyright issues, this repo contains only the code used for the tagger, along with example input and output files. It does not contain any data related to the contents of DDB.
For more information, please contact the author directly.
Author: Ross Deans Kristensen-McLachlan,Nicole Dwenger
Date: Spring 2023
This project is licensed under the MIT License - see the LICENSE file for details
"Ordbogsdata fra Den Danske Begrebsordbog, © Det Danske Sprog- og Litteraturselskab 2020".