Skip to content

danielsteinigen/nlp-legal-texts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Extraction of Key Figures and Their Properties From Tax Legal Texts using Neural Models

This repository contains the dataset and source code of the approaches and models from the paper "Semantic Extraction of Key Figures and Their Properties From Tax Legal Texts using Neural Models".

We provide our dataset, the best model and a demonstrator also at 🤗:

Dataset

In our paper we introduced KeyFiTax, a dataset with key figures based on paragraphs of German tax acts manually annotated by tax experts. The dataset is available under Datasets.


Two excerpts from paragraphs annotated according to our developed annotation scheme

The annotation of the dataset was performed according to the semantic model defined in this ontology (for more details refer to our paper):



Relations between annotation of the two excerpts from paragraphs according to our defined ontology:


The generated dataset includes 106 annotated paragraphs from 14 different German tax acts. The Dataset folder contains the following files:

  • KeyFiTax_Data.json: list of the paragraphs annotated according to the semantic model,
  • KeyFiTax_Fictitious.json: manually modified and fictitious paragraphs for additional evaluation of the approaches,
  • KeyFiTax_Overview.json: Overview of the extracted entities and relations overall and per tax act,
  • KeyFiTax_Model.rdf: Semantic Model for Key Figures and their properties in legal texts as decribed in our paper (created with Protégé),
  • KeyFiTax_KG.rdf: Knowledge Graph for Tax Key Figures populated based on the annotated paragraphs using the semantic model (can be opened with Protégé).

Following tables show the statistics of the generated dataset with the number of annotated instances per class.

Entity Type No. Samples
Key Figure (stated) 129
Key figure (declarative) 28
Expression (stated) 295
Expression (declarative) 32
Unit 284
Condition 492
Range 75
Factor 96

Relation Type No. Samples
hasParagraph 106
hasKeyFigure 157
hasExpression 319
hasUnit 278
hasCondition 399
hasRange 75
hasFactor 137
join 138

The list of tax acts of which paragraphs are included in the dataset can be found in the following table:

Tax Act German Name English Name
AO Abgabenordnung Fiscal Code
AStG Außensteuergesetz External Tax Relations Act
EStG Einkommensteuergesetz Income Tax Act
EStDV Einkommensteuer-Durchführungsverordnung Income Tax Implementing Regulation
EnergieStG Energiesteuergesetz Energy Tax Act
ErbStG Erbschaftsteuer- und Schenkungsteuergesetz Inheritance Tax Act
ErbStDV Erbschaftsteuer-Durchführungsverordnung Inheritance Tax Implementing Regulation
GewStG Gewerbesteuergesetz Trade Tax Act
GewStDV Gewerbesteuer-Durchführungsverordnung Trade Tax Implementing Regulation
KStG Körperschaftsteuergesetz Corporate Tax Act
KStDV Körperschaftsteuer-Durchführungsverordnung Corporate Tax Implementing Regulation
StromStG Stromsteuergesetz Electricity Tax Act
UStG Umsatzsteuergesetz Sales Tax Act
UStDV Umsatzsteuer-Durchführungsverordnung Value Added Tax Implementing Regulation

Semantic categories with some sample formulations and their English translations are listed below.

Categorie Examples in German English Translations
Key figure (stated) "Pauschbeträge für Werbungskosten", "Entfernungspauschale" "Lump sums for advertising expenses", "distance allowance"
Key figure (declarative) "Steuerabzug von den nach Abzug der Betriebsausgaben oder Werbungskosten verbleibenden Einnahmen" "Tax deduction from the income remaining after deduction of operating expenses or income-related expenses"
expression (stated) "0.35", "2 Millionen", "30 Prozent" "0.35", "2 millions", "30 percent"
expression (declarative) "10 Prozent der gesamten Einkünfte der ausländischen Gesellschaft" "10 percent of the total income of the foreign company"
Unit "Euro", "EUR", "€" "Euro", "EUR", "€"
Condition "Einnahmen aus nichtselbständiger Arbeit" "Income from non-employee work"
Range "insgesamt bis zu", "von mindestens" "in total up to", "of at least"
Factor "pro Kilometer", "monatlich" "per kilometer", "monthly"

The dataset was created using INCEpTION. Following is a screenshot of configured INCEpTION tool with some annotations.



Overview of the annotated paragraphs and key figures:



Approaches

To automatically extract the entities and relations from our dataset we employ different approaches, described in our paper. The source code of the models is available under models.

Transformers

We evaluated different Transformer architectures for entity extraction, which resulted in the performance scores presented below.

Model F1 in %
GBERT_BASE 53.97
GBERT_LARGE 52.59
GElectra_BASE 44.44
GElectra_LARGE 44.29
Longformer 38.88
XLM-RoBERTa_BASE 55.20
XLM-RoBERTa_LARGE 56.80
--- ---
spaCy-NER (cross-validated) 45.78
RASA-CRF (cross-validated) 44.10
XLM-RoBERTa_LARGE (cross-validated) 60.91
XLM-RoBERTa_LARGE-Triggers (cross-validated) 58.78

A detailed evaluation of the models performance for each classes is shown in the following charts (F1 score).


Cross Validation

For comparing the different approaches described in our paper, we use k-fold cross validation. We choose k= 5 and randomly divide the complete dataset into five equal-sized subsets anf report the averaged F1 score achieved over all five training runs. The results are shown in the chart below.


Combined Model

In our paper we propose an approach for combined entity and relation extraction, introducing so-called task triggers. For relation extraction the entities are marked using typed entity markers. An excerpt from a paragraph with marked/labeled entities and relations and trigger tokens for entity and relation extraction accordant our proposed approach is shown in the following figure.



The performance scores of the Trigger-Model for relation extraction on our dataset is presented below.

Relation F1 in %
hasCondition 62.99
hasExpression 72.54
hasUnit 97.37
hasFactor 76.60
hasRange 85.88
join 68.66
--- ---
Macro-averaged 77.34

Hyperparameters

The Hyperparameters of the Transformer-based models employed in the experiment are documented in the Transformers configuration file.
The Hyperparameters of the spaCy-NER model employed in the experiment are documented in the spaCy configuration file.
The Hyperparameters of the RASA-CRF model employed in the experiment are documented in the RASA configuration file.

Run the Scripts

The model training can be performed using the script run_training.py and set the required parameters. The following example will train the transfomer-based Model:

python run_training.py -f ../datasets/KeyFiTax/KeyFiTax_Data.json -m transformers

The model inference can be performed using the script run_inference.py and set the required parameters. The input data need to passed by file in the format of inference_sample.json. The following example will use a trained transfomer-based Model for inference:

python run_inference.py -f inference_sample.json -m transformers -p transformers_model/model/keyfitax

Modified paragraphs

We also createtd a test dataset of manually modified paragraphs by tax experts, as described in our paper. The following figures visulaize the extractions of our best performing model on a rephrased paragraph of EStG § 20 Abs. 9.



About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages