Merge PR #836 into 14.0

Signed-off-by alexis-via
OCA · Oct 24, 2023 · b985b6e · b985b6e
2 parents d9309db + 9de1261
commit b985b6e
Show file tree

Hide file tree

Showing 7 changed files with 59 additions and 21 deletions.
diff --git a/account_invoice_import_simple_pdf/__manifest__.py b/account_invoice_import_simple_pdf/__manifest__.py
@@ -12,9 +12,13 @@
     "maintainers": ["alexis-via"],
     "website": "https://github.com/OCA/edi",
     "depends": ["account_invoice_import"],
-    # "excludes": ["account_invoice_import_invoice2data"],
     "external_dependencies": {
-        "python": ["pdfplumber", "regex", "dateparser"],
+        "python": [
+            "pdfplumber",
+            "regex",
+            "dateparser",
+            "pypdf>=3.1.0",
+        ],
         "deb": ["libmupdf-dev", "mupdf", "mupdf-tools", "poppler-utils"],
     },
     "data": [

diff --git a/account_invoice_import_simple_pdf/readme/CONFIGURE.rst b/account_invoice_import_simple_pdf/readme/CONFIGURE.rst
@@ -9,6 +9,7 @@ If you want to force Odoo to use a specific text extraction method, go to the me
   #. pdftotext.lib
   #. pdftotext.cmd
   #. pdfplumber
+  #. pypdf
 
 In this configuration, Odoo will only use the selected text extraction method and, if it fails, it will display an error message.
 

diff --git a/account_invoice_import_simple_pdf/readme/INSTALL.rst b/account_invoice_import_simple_pdf/readme/INSTALL.rst
@@ -1,33 +1,27 @@
 The most important technical component of this module is the tool that converts the PDF to text. Converting PDF to text is not an easy job. As outlined in this `blog post <https://dida.do/blog/how-to-extract-text-from-pdf>`_, different tools can give quite different results. The best results are usually achieved with tools based on a PDF viewer, which exclude pure-python tools. But pure-python tools are easier to install than tools based on a PDF viewer. It is important to understand that, if you change the PDF to text tool, you will certainly have a slightly different text output, which may oblige you to update the field extraction rule, which can be time-consuming if you have already configured many vendors.
 
-The module supports 4 different extraction methods:
+The module supports 5 different extraction methods:
 
 1. `PyMuPDF <https://github.com/pymupdf/PyMuPDF>`_ which is a Python binding for `MuPDF <https://mupdf.com/>`_, a lightweight PDF toolkit/viewer/renderer published under the AGPL licence by the company `Artifex Software <https://artifex.com/>`_.
 #. `pdftotext python library <https://pypi.org/project/pdftotext/>`_, which is a python binding for the pdftotext tool.
 #. `pdftotext command line tool <https://en.wikipedia.org/wiki/Pdftotext>`_, which is based on `poppler <https://poppler.freedesktop.org/>`_, a PDF rendering library used by `xpdf <https://www.xpdfreader.com/>`_ and `Evince <https://wiki.gnome.org/Apps/Evince/FrequentlyAskedQuestions>`_ (the PDF reader of `Gnome <https://www.gnome.org/>`_).
 #. `pdfplumber <https://pypi.org/project/pdfplumber/>`_, which is a python library built on top the of the python library `pdfminer.six <https://pypi.org/project/pdfminer.six/>`_. pdfplumber is a pure-python solution, so it's very easy to install on all OSes.
+#. `pypdf <https://github.com/py-pdf/pypdf/>`_, which is one of the most common PDF lib for Python. pypdf is a pure-python solution, so it's very easy to install on all OSes.
 
-PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber often gives lower-quality text output, but its advantage is that it's a pure-Python solution, so you will always be able to install it whatever your technical environnement is.
+PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber and pypdf often give lower-quality text output, but their advantage is that they are pure-Python librairies, so you will always be able to install it whatever your technical environnement is.
 
 You can choose one extraction method and only install the tools/libs for that method.
 
 Install PyMuPDF
 ~~~~~~~~~~~~~~~
 
-To install **PyMuPDF**, if you use Debian (Bullseye aka v11 or higher) or Ubuntu (20.04 or higher), run the following command:
+Install it via pip:
 
 .. code::
 
-  sudo apt install python3-fitz
+  sudo pip3 install --upgrade pymupdf
 
-You can also install it via pip:
-
-.. code::
-
-  sudo pip3 install --upgrade PyMuPDF
-
-
-but beware that *PyMuPDF* is just a binding on MuPDF, so it will require MuPDF and all the development libs required to compile the binding. That's why *PyMuPDF* is much easier to install via the packages of your Linux distribution (package name **python3-fitz** on Debian/Ubuntu, but the package name may be different in other distributions) than with pip.
+Beware that *PyMuPDF* is not a pure-python library: it uses MuPDF, which is written in C language. If a python wheel for your OS, CPU architecture and Python version is available on pypi (check the `list of PyMuPDF wheels <https://pypi.org/project/PyMuPDF/#files>`_ on pypi), it will install smoothly. Otherwize, the installation via pip will require MuPDF and all its development libs to compile the binding.
 
 Install pdftotext python lib
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -64,6 +58,16 @@ To install the **pdfplumber** python lib, run:
 
   sudo pip3 install --upgrade pdfplumber
 
+Install pypdf
+~~~~~~~~~~~~~
+
+To install the **pypdf** python lib, run:
+
+.. code::
+
+  sudo pip3 install --upgrade pypdf
+
+
 Other requirements
 ~~~~~~~~~~~~~~~~~~
 

diff --git a/account_invoice_import_simple_pdf/tests/pdf/akretion_france-test.pdf b/account_invoice_import_simple_pdf/tests/pdf/akretion_france-test.pdf
diff --git a/account_invoice_import_simple_pdf/tests/test_invoice_import.py b/account_invoice_import_simple_pdf/tests/test_invoice_import.py
@@ -555,17 +555,22 @@ def test_complete_import(self):
         self.assertEqual(float_compare(iline.price_unit, 1509, precision_digits=2), 0)
         inv.unlink()
 
-    def test_complete_import_pdfplumber(self):
+    def _complete_import_specific_method(self, method):
         icpo = self.env["ir.config_parameter"]
         key = "invoice_import_simple_pdf.pdf2txt"
-        method = "pdfplumber"
         configp = icpo.search([("key", "=", key)])
         if configp:
             configp.write({"value": method})
         else:
             icpo.create({"key": key, "value": method})
         self.test_complete_import()
 
+    def test_specific_python_methods(self):
+        # test only pure-pdf methods
+        # because we are sure they work on the Github test environment
+        self._complete_import_specific_method("pdfplumber")
+        self._complete_import_specific_method("pypdf")
+
     def test_test_mode(self):
         self.partner_ak.write(
             {

diff --git a/account_invoice_import_simple_pdf/wizard/account_invoice_import.py b/account_invoice_import_simple_pdf/wizard/account_invoice_import.py
@@ -28,6 +28,10 @@
     import pdftotext
 except ImportError:
     logger.debug("Cannot import pdftotext")
+try:
+    import pypdf
+except ImportError:
+    logger.debug("Cannot import pypdf")
 
 
 class AccountInvoiceImport(models.TransientModel):
@@ -50,13 +54,13 @@ def _simple_pdf_text_extraction_pymupdf(self, fileobj, test_info):
             pages = []
             doc = fitz.open(fileobj.name)
             for page in doc:
-                pages.append(page.getText("text"))
+                pages.append(page.get_text())
             res = {
                 "all": "\n\n".join(pages),
                 "first": pages and pages[0] or "",
             }
-            logger.info("Text extraction made with PyMuPDF")
-            test_info["text_extraction"] = "pymupdf"
+            logger.info("Text extraction made with PyMuPDF %s", fitz.__version__)
+            test_info["text_extraction"] = "pymupdf %s" % fitz.__version__
         except Exception as e:
             logger.warning("Text extraction with PyMuPDF failed. Error: %s", e)
         return res
@@ -76,8 +80,23 @@ def _simple_pdf_text_extraction_pdfplumber(self, fileobj, test_info):
                 "all": "\n\n".join(pages),
                 "first": pages and pages[0] or "",
             }
-        test_info["text_extraction"] = "pdfplumber"
-        logger.info("Text extraction made with pdfplumber")
+        test_info["text_extraction"] = "pdfplumber %s" % pdfplumber.__version__
+        logger.info("Text extraction made with pdfplumber %s", pdfplumber.__version__)
+        return res
+
+    @api.model
+    def _simple_pdf_text_extraction_pypdf(self, fileobj, test_info):
+        res = False
+        reader = pypdf.PdfReader(fileobj.name)
+        pages = []
+        for pdf_page in reader.pages:
+            pages.append(pdf_page.extract_text())
+            res = {
+                "all": "\n\n".join(pages),
+                "first": pages and pages[0] or "",
+            }
+        test_info["text_extraction"] = "pypdf %s" % pypdf.__version__
+        logger.info("Text extraction made with pypdf %s", pypdf.__version__)
         return res
 
     @api.model
@@ -147,6 +166,8 @@ def _simple_pdf_text_extraction_specific_tool(
             res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
         elif specific_tool == "pdfplumber":
             res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
+        elif specific_tool == "pypdf":
+            res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)
         else:
             raise UserError(
                 _(
@@ -195,6 +216,8 @@ def simple_pdf_text_extraction(self, file_data, test_info):
                 res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
             if not res:
                 res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
+            if not res:
+                res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)
             if not res:
                 raise UserError(
                     _(

diff --git a/requirements.txt b/requirements.txt
@@ -5,6 +5,7 @@ invoice2data
 ovh
 pdfplumber
 phonenumbers
+pypdf>=3.1.0
 pyyaml
 regex
 xmlschema