GitHub - ratschlab/medical-reports-deidentification

A pipeline to deidentify clinical texts/reports in German.

The pipeline code was developed for the needs of the University Hospital Zurich can however be adapted fairly easily for other hospitals contexts. It takes reports in JSON format and in a first step annotates identifying information such as names, locations, age, dates, organisations and occupations. In a second step, the annotated text parts are substituted with some other text, where different strategies can be applied and reexported to JSON.

The pipeline to recognize identifying information is "rule" based, that is various lexica and deterministic rules are used. It is based on the GATE framework.

Some important features:

parallel execution to scale to large corpora of reports (>100'000 reports)
test suite to test annotation pipeline, also accessible to non-software developers tuning rules and lexica
large parts of pipeline tuning can be done without writing code
annotations contain information to trace back which pipeline step or rule generated the annotation

Installation

Installation and running instructions

Pipeline Details

Adapting the Pipeline

The tool is laid out such that it can be relatively easily adapted to another hospital or another context. Adaptation to another language is possible in principle, may incur more work though.

More details about Pipeline components and their configuration
Description of the rules and how to tune them
Tuning tutorial with a heavily simplified pipeline

Other Functionalities

For simplicity, some more functionality has been included into this code base which is not related to deidentification per se, but still to working with medical reports.

Diagnosis Extraction Pipeline

Diagnosis Extraction Pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.github/workflows		.github/workflows
configs		configs
deidentifier-pipeline		deidentifier-pipeline
deployment		deployment
docs		docs
scripts		scripts
toy-corpora		toy-corpora
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Pipeline Details

Adapting the Pipeline

Other Functionalities

Diagnosis Extraction Pipeline

Development

About

Releases 2

Packages

Languages

License

ratschlab/medical-reports-deidentification

Folders and files

Latest commit

History

Repository files navigation

Installation

Pipeline Details

Adapting the Pipeline

Other Functionalities

Diagnosis Extraction Pipeline

Development

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages