A pipeline to deidentify clinical texts/reports in German.
The pipeline code was developed for the needs of the University Hospital Zurich can however be adapted fairly easily for other hospitals contexts. It takes reports in JSON format and in a first step annotates identifying information such as names, locations, age, dates, organisations and occupations. In a second step, the annotated text parts are substituted with some other text, where different strategies can be applied and reexported to JSON.
The pipeline to recognize identifying information is "rule" based, that is various lexica and deterministic rules are used. It is based on the GATE framework.
Some important features:
- parallel execution to scale to large corpora of reports (>100'000 reports)
- test suite to test annotation pipeline, also accessible to non-software developers tuning rules and lexica
- large parts of pipeline tuning can be done without writing code
- annotations contain information to trace back which pipeline step or rule generated the annotation
The tool is laid out such that it can be relatively easily adapted to another hospital or another context. Adaptation to another language is possible in principle, may incur more work though.
- More details about Pipeline components and their configuration
- Description of the rules and how to tune them
- Tuning tutorial with a heavily simplified pipeline
For simplicity, some more functionality has been included into this code base which is not related to deidentification per se, but still to working with medical reports.