Skip to content

How does the anonymizer work

Dimitris Katsiros edited this page Aug 26, 2019 · 9 revisions

Anonymizer as Package

The main function that implements the whole service's funtionality is find_entities().

You can import find_entities() by simply typing:

from anonymizer.anonymize import find_entities

The function takes the following arguments:

find_entities(ifile,
              ofile=None,
              method=['strict', "*", "True"],
              patterns_file='patterns.json',
              verbose=False,
              words_array=[],
              quick=False)

ifile

This is the path of the file given as input.

ofile

This is the path of the anonymized output file.

By default, ofile = ifile + '_anonymized' + (extension_name_of_the_original_file).

method

This is the method used for anonymization. It contains an array with 3 items.

  1. 'strict'.

  2. A char indicating the char that should replace any anonymized words.

  3. A number as length if user needs a specific length of symbols for replacement, or 'True' if user wants the length to be equal to the originals word's length.

verbose

If True the service prints on terminal all entities identified.

words_array

A array of words that will be anonymized in document. All words are given by the user manually, therefore giving a flexibility to the service.

quick

If True, service runs only searching for words in words_array.

Anonymizer as Python Module

Syntax

python3 -m anonymizer

    -i <inputfile>
    
    -o <outputfile>
    
    -f <folder>
    
    -m <method_used(s,strict)/symbol/(lenght==lenght_of_word)>
    
    -p <patterns.json>
    
    -v <verbose_mode>
    
    -w <string of words separated by commas>

    -q <quick_mode>

Default

python3 -m anonymizer -i testfile.odt -o testfile_anonymized.odt -m s/*/True -p anonymizer/patterns.json

Explanation

  • i: Specify the input's file path.
  • o: Specify the output's file path.
  • f: Specify a folder's path. If set, the module will anonymize all .txt and .odt files in the folder.
  • m:
    • method: Strict method.
    • symbol: Specify the symbol that will replace sensitive information.
    • length: If True lenght is set to len(entity), else if a number n is given each entity shall be replaced with symbol n times, always respecting the original alignment/format of the text.
  • p: Specify the patterns file. A default pattern file is given in anonymizer/patterns.json
  • v: If typed, verbose mode is on. In verbose mode all identified entities are printed on the console.
  • w: If typed the parser expects a string with words separated by commas (,). Each word is anonymized in the text, adding flexibility to the service. Therefore the user can anonymize words that may not have been identified by default.
  • q: If typed, quick mode is on. In quick mode the service only searches for entities given by user through the -w input. This is useful in cases where a text has already been parsed and user wants to anonymize additional entities.