Python script to parse pdf files into text files.
You can get the latest stable version by cloning the master
branch of our repository with :
git clone https://gitlab.com/inf1603/irisaparser.git
For other versions please see the release section on GitLab, and download the source archive corresponding to the desired version.
Make sure that Python and Pip are installed on your computer.
- Python
>3.10
- Pip
22.0
- Java
>7
This step is necessary only if you do not want to build the package yourself, if you do not wisth to install manually these dependencies, jump to the Build section.
Once all software requirements are satisfied, please install the following pip packages : To install PiPy packages make sure that pip is installed and use :
pip install <packagename>
The following dependencies are required :
- pdfplumber 0.6.0
- tika 1.24
- spacy 3.2.2
- colorama 0.4.4
You can also install every dependency in one go with :
pip install pdfplumber tika spacy colorama
/!\ At first usage, the module will download tika java package, this may take a lot of time. /!\
Note that you must have the dependencies of the Pip package section to use irisaparser this way, if it's not the case please see the Build section or consult the Pip package section.
Make sure you are placed above the irisaparser
directory, or for more convenience build the package.
To parse a pdf file use the command:
python3 -m irisaparser <args>
For example to parse a file named f1.pdf
located in /home/user/pdfs/
and to output results in /tmp
, use:
python3 -m irisaparser /home/user/pdfs/f1.pdf -o /tmp
To parse all the files located in /home/user/pdfs
, use:
python3 -m irisaparser -d /home/user/pdfs
Note that if no output directory is provided using -o
, text files will be placed in current directory.
For more information about how to use the module, type the following command:
python3 -m irisaparser --help
Make sure that you have setuptools and wheel on your system, you can install these dependencies with the command:
pip install setuptools wheel
To build the package make sure that you are above the irisaparser
directory inside the file hierarchy, and use the following command:
python3 setup.py bdist_wheel
for more information about this command consult the documentation
To install your package, use the following command with the .whl
file you just built:
pip install <some .whl file>
/!\ At first usage, the module will download tika java package, this may take a lot of time. /!\
Once you have installed the package, you can use it with the following command everywhere:
python3 -m irisaparser <args>
You can also import the package in a python script like this :
import irisaparser
# parse a file and put the output in the specified directory:
irisaparser.parseFile("filename","./out")
# parse a file and put the xml output in the current directory:
irisaparser.parseFile("filename",xml=True)
# parse multiple files provided in the list parameter files:
filesnames = ["file1","file2"]
irisaparser.parseFiles(filenames)
# do the equivalent of the command line usage, just pass the arguments as a list of strings:
args = ["-d","./Corpus_2021","-o","./out"]
irisaparser.parseArgs(args)
For more information about the different options and behaviors of the program consult our documentation page (or the DOC.md
file if you downloaded the source code).