This repository contains the software to run the Translation Memory anonymization service developed within the CEF Data MarketPlace project. A service based on this tool is offered by the TAUS Data MarketPlace platform.
The goal of the tool is to protect private information possibly contained in TMs uploaded to the Marketplace. This is obtained by detecting Personally Identifiable Information (PII) in the source and target-language sides of a translation memory.
A PII is any data that could potentially identify a specific individual. The PIIs identified by the tool are:
person names, emails, URLs, phone numbers, credit card numbers, driver’s license numbers, identity card numbers, passport numbers, social security numbers, license plate numbers.
The tool includes two different libraries to extract the required PIIs from the source and target language texts.
Person names and addresses are extracted using the DeepPavlov NER tool. It is a hybrid model based on Multilingual BERT adapted for the named entity recognition task. Among all the possible types of entities, our tool selects only the persons.
All the other PIIs are obtained by in-house software based on regular expressions and language-specific knowledge and patterns.
The tool is able to extract PIIs in 24 languages: Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Norwegian (Bokmal), Norwegian (Nynorsk), Polish, Portuguese, Romanian, Spanish, Slovak, Swedish, Turkish.
The tool is accessible by an API that allows a user to process one or multiple TUs at the time. More details about the API specifications are available below.
The first step is to download the Docker image of the code image.anonymization_service__v2.tar.gz (around 3GB)
No specific hardware or software is required in addition to a working "docker" installation (only the optional "email" functionality requires an email sending service running on the host)
Once the Docker image has been downloaded, it has to be added to your docker environment
$ docker load < image.anonymization_service__v2.tar.gz
To start the service, run the following command:
$ docker run --rm -it --net=host anonymization_service
The process prints different information, when it prints the message
web service ready at port 8080
this means it is ready to accept requests.
Requests can be issued at the following URLs:
- http://localhost:8080/anonymization_service.php
- http://${PUBLIC-IP}:8080/anonymization_service.php
The request
curl -X POST -F units='id1|en|credit card 5592-1234-5678-9876 of mr. John Watson and of Jochen Mass domiciled in Pennsylvania Avenue 44|it|bla bla bla|id2|en|We recommend the sites bbc.co.uk and cnn.com|it|Paolo Rossi and Giuseppina Verdi propongono i siti agriturismo.it dolomiti.it solocane.net' http://localhost:8080/anonymize_service.php
produces the response:
{
"status": 0,
"payload": [
{
"id": "id1", "side": 0,
"annotations": [
{
"type": "PER",
"values": ["John Watson", "Jochen Mass"]
},
{
"type": "CREDITCARD",
"values": ["5592-1234-5678-9876"]
},
{
"type": "ADDRESS",
"values": ["Pennsylvania Avenue 44"]
}
]
},
{
"id": "id2", "side": 0,
"annotations": [
{
"type": "URL",
"values": ["bbc.co.uk", "cnn.com"]
}
]
},
{
"id": "id2", "side": 1,
"annotations": [
{
"type": "PER",
"values": ["Paolo Rossi", "Giuseppina Verdi"]
},
{
"type": "URL",
"values": ["agriturismo.it", "dolomiti.it", "solocane.net"]
}
]
}
]
}
The API specs of the Anonymization Service are available here
To test the tool, a web graphical interface is made available in the Docker. It consists of a simple web page, where a text can be inserted and it is processed by the tool returning the list of the identified PIIs. The GUI allows the user to provide an email address to which the output of the tool is sent.
In order to build the docker image, the steps are:
- download the archive setup_for_docker_build_AS_image__v2.tar.gz
- extract the data from the archive
tar xvfz setup_for_docker_build_AS_image.tar.gz
- run the script DO_build_AS_image.sh
bash DO_build_AS_image.sh
FBK and Translated developed the Anonymization Service:
- FBK for the web service, interface with MBERT/DeepPavlov (NE processing), web GUI and integration;
- Translated for the Translated Anonymizer component (PII processing).
Please email cattoni AT fbk DOT eu