This project contains python-modules to
- extract text from different formats (*.doc, *.docx, *.odt, *.pdf, *.rtf)
- removes header and footer
- seperate sentences
It contains setup-files for the server distribution of ubuntu and the python-version 3.4.3.
If you would like to install these files, you go into the folder install and type ./inst.sh
.
The seperator-module use the Natural Language Toolkit and is distributed under the terms of the Apache License Version 2.0.
We refer to the following book:
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
The docx-module of the converter use docx2txt and is distributed under the terms of the GPLv3.
The following table shows you on which ubuntu versions the project is tested:
Version | Tested |
---|---|
16.04 Server | ✔️ |
18.04 Server | ✔️ |
20.04 Server | ✔️ |