Skip to content

Commit

Permalink
Merge PR #836 into 14.0
Browse files Browse the repository at this point in the history
Signed-off-by alexis-via
  • Loading branch information
OCA-git-bot committed Oct 24, 2023
2 parents d9309db + 9de1261 commit b985b6e
Show file tree
Hide file tree
Showing 7 changed files with 59 additions and 21 deletions.
8 changes: 6 additions & 2 deletions account_invoice_import_simple_pdf/__manifest__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,13 @@
"maintainers": ["alexis-via"],
"website": "https://github.com/OCA/edi",
"depends": ["account_invoice_import"],
# "excludes": ["account_invoice_import_invoice2data"],
"external_dependencies": {
"python": ["pdfplumber", "regex", "dateparser"],
"python": [
"pdfplumber",
"regex",
"dateparser",
"pypdf>=3.1.0",
],
"deb": ["libmupdf-dev", "mupdf", "mupdf-tools", "poppler-utils"],
},
"data": [
Expand Down
1 change: 1 addition & 0 deletions account_invoice_import_simple_pdf/readme/CONFIGURE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ If you want to force Odoo to use a specific text extraction method, go to the me
#. pdftotext.lib
#. pdftotext.cmd
#. pdfplumber
#. pypdf

In this configuration, Odoo will only use the selected text extraction method and, if it fails, it will display an error message.

Expand Down
28 changes: 16 additions & 12 deletions account_invoice_import_simple_pdf/readme/INSTALL.rst
Original file line number Diff line number Diff line change
@@ -1,33 +1,27 @@
The most important technical component of this module is the tool that converts the PDF to text. Converting PDF to text is not an easy job. As outlined in this `blog post <https://dida.do/blog/how-to-extract-text-from-pdf>`_, different tools can give quite different results. The best results are usually achieved with tools based on a PDF viewer, which exclude pure-python tools. But pure-python tools are easier to install than tools based on a PDF viewer. It is important to understand that, if you change the PDF to text tool, you will certainly have a slightly different text output, which may oblige you to update the field extraction rule, which can be time-consuming if you have already configured many vendors.

The module supports 4 different extraction methods:
The module supports 5 different extraction methods:

1. `PyMuPDF <https://github.com/pymupdf/PyMuPDF>`_ which is a Python binding for `MuPDF <https://mupdf.com/>`_, a lightweight PDF toolkit/viewer/renderer published under the AGPL licence by the company `Artifex Software <https://artifex.com/>`_.
#. `pdftotext python library <https://pypi.org/project/pdftotext/>`_, which is a python binding for the pdftotext tool.
#. `pdftotext command line tool <https://en.wikipedia.org/wiki/Pdftotext>`_, which is based on `poppler <https://poppler.freedesktop.org/>`_, a PDF rendering library used by `xpdf <https://www.xpdfreader.com/>`_ and `Evince <https://wiki.gnome.org/Apps/Evince/FrequentlyAskedQuestions>`_ (the PDF reader of `Gnome <https://www.gnome.org/>`_).
#. `pdfplumber <https://pypi.org/project/pdfplumber/>`_, which is a python library built on top the of the python library `pdfminer.six <https://pypi.org/project/pdfminer.six/>`_. pdfplumber is a pure-python solution, so it's very easy to install on all OSes.
#. `pypdf <https://github.com/py-pdf/pypdf/>`_, which is one of the most common PDF lib for Python. pypdf is a pure-python solution, so it's very easy to install on all OSes.

PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber often gives lower-quality text output, but its advantage is that it's a pure-Python solution, so you will always be able to install it whatever your technical environnement is.
PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber and pypdf often give lower-quality text output, but their advantage is that they are pure-Python librairies, so you will always be able to install it whatever your technical environnement is.

You can choose one extraction method and only install the tools/libs for that method.

Install PyMuPDF
~~~~~~~~~~~~~~~

To install **PyMuPDF**, if you use Debian (Bullseye aka v11 or higher) or Ubuntu (20.04 or higher), run the following command:
Install it via pip:

.. code::
sudo apt install python3-fitz
sudo pip3 install --upgrade pymupdf
You can also install it via pip:

.. code::
sudo pip3 install --upgrade PyMuPDF
but beware that *PyMuPDF* is just a binding on MuPDF, so it will require MuPDF and all the development libs required to compile the binding. That's why *PyMuPDF* is much easier to install via the packages of your Linux distribution (package name **python3-fitz** on Debian/Ubuntu, but the package name may be different in other distributions) than with pip.
Beware that *PyMuPDF* is not a pure-python library: it uses MuPDF, which is written in C language. If a python wheel for your OS, CPU architecture and Python version is available on pypi (check the `list of PyMuPDF wheels <https://pypi.org/project/PyMuPDF/#files>`_ on pypi), it will install smoothly. Otherwize, the installation via pip will require MuPDF and all its development libs to compile the binding.

Install pdftotext python lib
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -64,6 +58,16 @@ To install the **pdfplumber** python lib, run:
sudo pip3 install --upgrade pdfplumber
Install pypdf
~~~~~~~~~~~~~

To install the **pypdf** python lib, run:

.. code::
sudo pip3 install --upgrade pypdf
Other requirements
~~~~~~~~~~~~~~~~~~

Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -555,17 +555,22 @@ def test_complete_import(self):
self.assertEqual(float_compare(iline.price_unit, 1509, precision_digits=2), 0)
inv.unlink()

def test_complete_import_pdfplumber(self):
def _complete_import_specific_method(self, method):
icpo = self.env["ir.config_parameter"]
key = "invoice_import_simple_pdf.pdf2txt"
method = "pdfplumber"
configp = icpo.search([("key", "=", key)])
if configp:
configp.write({"value": method})
else:
icpo.create({"key": key, "value": method})
self.test_complete_import()

def test_specific_python_methods(self):
# test only pure-pdf methods
# because we are sure they work on the Github test environment
self._complete_import_specific_method("pdfplumber")
self._complete_import_specific_method("pypdf")

def test_test_mode(self):
self.partner_ak.write(
{
Expand Down
33 changes: 28 additions & 5 deletions account_invoice_import_simple_pdf/wizard/account_invoice_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@
import pdftotext
except ImportError:
logger.debug("Cannot import pdftotext")
try:
import pypdf
except ImportError:
logger.debug("Cannot import pypdf")


class AccountInvoiceImport(models.TransientModel):
Expand All @@ -50,13 +54,13 @@ def _simple_pdf_text_extraction_pymupdf(self, fileobj, test_info):
pages = []
doc = fitz.open(fileobj.name)
for page in doc:
pages.append(page.getText("text"))
pages.append(page.get_text())
res = {
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
}
logger.info("Text extraction made with PyMuPDF")
test_info["text_extraction"] = "pymupdf"
logger.info("Text extraction made with PyMuPDF %s", fitz.__version__)
test_info["text_extraction"] = "pymupdf %s" % fitz.__version__
except Exception as e:
logger.warning("Text extraction with PyMuPDF failed. Error: %s", e)
return res
Expand All @@ -76,8 +80,23 @@ def _simple_pdf_text_extraction_pdfplumber(self, fileobj, test_info):
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
}
test_info["text_extraction"] = "pdfplumber"
logger.info("Text extraction made with pdfplumber")
test_info["text_extraction"] = "pdfplumber %s" % pdfplumber.__version__
logger.info("Text extraction made with pdfplumber %s", pdfplumber.__version__)
return res

@api.model
def _simple_pdf_text_extraction_pypdf(self, fileobj, test_info):
res = False
reader = pypdf.PdfReader(fileobj.name)
pages = []
for pdf_page in reader.pages:
pages.append(pdf_page.extract_text())
res = {
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
}
test_info["text_extraction"] = "pypdf %s" % pypdf.__version__
logger.info("Text extraction made with pypdf %s", pypdf.__version__)
return res

@api.model
Expand Down Expand Up @@ -147,6 +166,8 @@ def _simple_pdf_text_extraction_specific_tool(
res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
elif specific_tool == "pdfplumber":
res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
elif specific_tool == "pypdf":
res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)
else:
raise UserError(
_(
Expand Down Expand Up @@ -195,6 +216,8 @@ def simple_pdf_text_extraction(self, file_data, test_info):
res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
if not res:
res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
if not res:
res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)
if not res:
raise UserError(
_(
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ invoice2data
ovh
pdfplumber
phonenumbers
pypdf>=3.1.0
pyyaml
regex
xmlschema

0 comments on commit b985b6e

Please sign in to comment.