Skip to content

Latest commit

 

History

History
188 lines (150 loc) · 10.1 KB

README.md

File metadata and controls

188 lines (150 loc) · 10.1 KB

DERBI: DEutscher RegelBasierter Inflektor

License

DERBI (DEutscher RegelBasierter Inflektor) is a simple rule-based automatic inflection model for German based on spaCy.
Applicable regardless of POS!


Table of Contents


How It Works

  1. DERBI gets an input text;
  2. The text is processes with the given spaCy model;
  3. For each word to be inflected in the text:
    • The features predicted by spaCy are overridden with the input features (where specified);
    • The words with the result features come through the rules and get inflected;
  4. The result is assembled into the output.

For the arguments, see below.

Installation

Via pip

pip install DERBI

Via git clone

Install all necessary packages:

pip install -r requirements.txt

Clone DERBI:

git clone https://github.com/maxschmaltz/DERBI

or

from git import Repo
Repo.clone_from('https://github.com/maxschmaltz/DERBI', 'DERBI')

Simple Usage

Note that DERBI works with spaCy. Make sure to have installed any of the spaCy pipelines for German.

Example

# python -m spacy download de_core_news_sm
nlp = spacy.load('de_core_news_md')

from DERBI.derbi import DERBI
derbi = DERBI(nlp)

derbi(
    'DERBI sein machen, damit es all Entwickler ein Möglichkeit geben, jedes deutsche Wort automatisch zu beugen',
    [{'Number': 'Sing', 'Person': '3', 'Verbform': 'Fin'},     # sein -> ist
     {'Verbform': 'Part'},                                     # machen -> gemacht
     {'Case': 'Dat', 'Number': 'Plur'},                        # all -> allen
     {'Case': 'Dat', 'Number': 'Plur'},                        # Entwickler -> entwicklern
     {'Gender': 'Fem'},                                        # ein -> eine
     {'Number': 'Sing', 'Person': '3', 'Verbform': 'Fin'},     # geben -> gibt
     {'Case': 'Acc', 'Number': 'Plur'},                        # jedes -> jede
     {'Case': 'Acc', 'Declination': 'Weak', 'Number': 'Plur'}, # deutsche -> deutschen
     {'Case': 'Acc', 'Number': 'Plur'}],                       # wort -> wörter
    [1, 2, 6, 7, 8, 10, 12, 13, 14]
)

# Output:
'derbi ist gemacht , damit es allen entwicklern eine möglichkeit gibt , jede deutschen wörter automatisch zu beugen'

Arguments

__init__() Arguments

  • model: spacy.lang.de.German

Any of the spaCy pipelines for German. If model is not of the type spacy.lang.de.German, throws an exception.

__call__() Arguments

  • text: str

Input text, containing the words to be inflected. It is strongly recommended to call DERBI with a text, not a single word, as spaCy predictions vary depending on the context.

  • target_tags: dict or list[dict]

Dicts of category-feature values for each word to be inflected. If None, no inflection is implemented. Default is None.

NB! As the features are overriden over the ones predicted by spaCy, in target_tags only different ones can be specified. Note though, that spaCy predictions are not always correct, so for the DERBI output to be more precise, we recommend to specify the desired features fully. Notice also, that if no tags for an obligatory category were provided (neither by spaCy, neither in target_tags), DERBI restores them as default; default features values are available at ValidFeatures (the first element for every category).

  • indices: int or list[int]

Indices of the words to be inflected. Default is 0.

NB! The indices order must correspond to the target tags order. Note also, that the input text is lemmatized with the given spaCy model tokenizer, so the indices will be indexing a spacy.tokens.Doc instance.

Output

Returns str: the input text, where the specified words are replaced with the inflection results. The output is normalized.

Tags

DERBI uses Universal POS tags and Universal Features (so does spaCy) with some extensions of features (not POSs). See LabelScheme and ValidFeatures for more details.

The following category-feature values can be used in target-tags:

Category (explanation) Valid Features (explanation) In Universal Features
Case Acc (Accusative)
Dat (Dative)
Gen (Genitive)
Nom (Nominative)
Yes
Declination (Applicable for the words
with the adjective declination.
In German such words are declinated
differently depending on the left context)
Mixed
Strong
Weak
No
Definite (Definiteness) Def (Definite)
Ind (Definite)
Yes
Degree (Degree of comparison) Cmp (Comparative)
Pos (Positive)
Sup (Superlative)
Yes
Foreign (Whether the word is foreign.
Applies to POS X)
Yes Yes
Gender Fem (Feminine)
Masc (Masculine)
Neut (Neutral)
Yes
Mood Imp (Imperative)
Ind (Indicative)
Sub (Subjunctive)

NB! Sub is for Konjunktiv I
when Tense=Pres and for
Konjunktiv II when Tense=Past)
Yes
Number Plur (Plural)
Sing (Singular)
Yes
Person 1
2
3
Yes
Poss (Whether the word is possessive.
Applies to pronouns and determiners.)
Yes Yes
Prontype (Type of a pronoun, a determiner,
a quantifier or a pronominal adverb.
Art (Article)
Dem (Demonstrative)
Ind (Indefinite)
Int (Interrogative)
Prs (Personal)
Rel Relative
Yes
Reflex (Whether the word is reflexive.
Applies to pronouns and determiners.)
Yes Yes
Tense Past
Pres (Present)
Yes
Verbform (Form of a verb) Fin (Finite)
Inf (Infinitive)
Part (Participle)

NB! Part is for Partizip I
when Tense=Pres and for
Partizip II when Tense=Past)
Yes

Note though, that categories Definite, Foreign, Poss, Prontype and Reflex cannot be alternated by DERBI, and thus there is no need to specify them.

NB! DERBI accepts capitalized tags. For example, use Prontype, not PronType.

Performance

Disclaimer

For evaluation we used Universal Dependencies German Treebanks. Unfortunately, there are only .conllu in their GitHub repositories so we had to download some of .txt datasets and add it to our repository. We do not distribute these datasets though; it is your responsibility to determine whether you have permission to use them.

Evaluation

Evaluation conducted with dataset de_lit-ud-test.txt from Universal Dependencies German LIT threebank (≈31k tokens), accuracy:

de_core_news_md de_core_news_sm de_core_news_lg
Overall 0.947 0.949 0.95
ADJ 0.81 0.847 0.841
ADP 0.998 0.998 0.998
ADV 0.969 0.972 0.968
AUX 0.915 0.921 0.912
CCONJ 1.0 1.0 1.0
DET 0.988 0.992 0.988
INTJ 1.0 1.0 1.0
NOUN 0.958 0.959 0.962
NUM 0.935 0.987 0.914
PART 1.0 1.0 1.0
PRON 0.921 0.929 0.928
PROPN 0.941 0.926 0.916
SCONJ 0.999 0.999 0.996
VERB 0.813 0.792 0.824
X 1.0 1.0 1.0

If you are interested in the way we obtained the results, please refer to test0.py.

Or you could check it with the following code:

from DERBI.test import test0
test0.main()

Notice that performance might vary depending on the dataset. Also remember, that if spaCy might make mistakes predicting (that means, that in some cases DERBI inflection is correct but does not correspond spaCy's tags), which also affects evaluation.

License

Copyright 2022 Max Schmaltz: @maxschmaltz

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.