FastAPI and PDFAct new implementation (#2)

* Replace Flask with FastAPI * Add PDFAct driver --------- Co-authored-by: AnnaMarika01 <[email protected]>
OneOffTech · Jun 4, 2024 · b18be23 · b18be23
1 parent 24103fa
commit b18be23
Show file tree

Hide file tree

Showing 33 changed files with 485 additions and 235 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -23,4 +23,4 @@ jobs:
         fetch-depth: 1
 
     - name: Lint the Shell scripts
-      run: shellcheck ./gunicorn.sh
+      run: shellcheck ./uvicorn.sh
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
 .idea
+logs
 .env
diff --git a/Dockerfile b/Dockerfile
@@ -33,13 +33,15 @@ WORKDIR /app
 COPY --from=build-image /opt/venv /opt/venv
 ENV PATH="/opt/venv/bin:$PATH"
 
-COPY parsing_service/ parsing_service/
-COPY root.py gunicorn.sh ./
 
-RUN chmod +x ./gunicorn.sh
+COPY text_extractor_api/ text_extractor_api/
+COPY text_extractor/ text_extractor/
+COPY root.py uvicorn.sh ./
+
+RUN chmod +x ./uvicorn.sh
 
 EXPOSE 5000/tcp
 
 ENTRYPOINT ["tini", "--"]
 
-CMD ["/app/gunicorn.sh"]
+CMD ["/app/uvicorn.sh"]
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 [![CI](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml) [![Build Docker Image](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml)
 
-# PDF Text extraction service for Data House
+# PDF Text Extraction Service
 
-Extract text from PDFs keeping page information.
+A FastAPI application to extract text from pdf documents.
 
 ## Getting started
 
@@ -18,18 +18,9 @@ A sample [`docker-compose.yaml` file](./docker-compose.yaml) is available within
 > Please refer to [Releases](https://github.com/data-house/pdf-text-extractor/releases) and [Packages](https://github.com/data-house/pdf-text-extractor/pkgs/container/pdf-text-extractor) for the available tags.
 
 
-**Available environment variables**
-
-| variable | default | description |
-|------|---------|-------------|
-| `GUNICORN_WORKERS` | 2 | The number of [Gunicorn](https://docs.gunicorn.org/en/latest/settings.html#worker-class) sync workers |
-| `GUNICORN_WORKERS_TIMEOUT` | 600 | The timeout, in seconds, of each worker |
-
-
-
 ## Usage
 
-The PDF Text Extract service expose a web application on port `5000`. The available API receive a PDF file via a URL and return the extracted text as a JSON response.
+The PDF Text Extract service expose a web application. The available API receive a PDF file via a URL and return the extracted text as a JSON response.
 
 The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front.
 
@@ -38,44 +29,44 @@ The exposed service is unauthenticated therefore consider exposing it only withi
 The service expose only one endpoint `/extract-text` that accepts a `POST` request
 with the following input as a `json` body:
 
-- `url` the URL of the PDF file to process
-- `mime_type` the mime type of the file (it is expected to be `application/pdf`)
+- `url`: the URL of the PDF file to process.
+- `mime_type`: the mime type of the file (it is expected to be `application/pdf`).
+- `driver`: two drivers are currently implemented `pymupdf` and `pdfact`. It defines the extraction backend to use.
 
 > **warning** The processing is performed synchronously
 
 
-The response will be a JSON containing:
-
-- `status` the status of the operation. Usually `ok`.
-- `content` a list of objects describing the chunked content with the page reference. Each object contains a `text` property with the part of the PDF text and a `metadata` object with the `page_number` property representing the page of the PDF from which the `text` was extracted.
+The response is a JSON with the extracted text splitted in chunks. In particular, the structure is as follows:
 
-The following code block shows a possible output:
+- `text`: The list of chunks, each composed by:
+    - `text`: The text extracted from the chunk.
+    - `metadata`: A json with additional information regarding the chunk.
+- `fonts`: The list of fonts used in the document. 
+Each font is represented by `name`, `id`, `is-bold`, `is-type3` and `is-italic`. 
+Available only using `pdfact` driver.
+- `colors`: The list of colors used in the document.
+Each color is represented by `r`, `g`, `b` and `id`.
+Available only using `pdfact` driver.
 
-```json
-{
-  "status": "ok",
-  "content": [
-    {
-      "text": "This is a test PDF to be used as input in unit tests",
-      "metadata": {
-        "page_number": 1
-      }
-    }
-  ]
-}
-```
+The `metadata` of each chunk contains the following information:
+- `page`: The page number from which the chunk has been extracted.
+- `role`: The role of the chunk in the document (e.g., _heading_, _body_, etc.)
+- `positions`: A list of bounding box containing the text. 
+Each bounding box is identified by 4 coordinated: `minY`, `minX`, `maxY` and `maxX`.
+- `font`: The font of the chunk.
+- `color`: The color of the chunk.
 
 ### Error handling
 
 The service can return the following errors
 
-| code | message | description |
-|------|---------|-------------|
-| `422` | No url found in request | In case the `url` field in the request is missing |
-| `422` | No mime_type found in request | In case the `mime_type` field in the request is missing |
-| `422` | Unsupported file type | In case the file is not a PDF |
-| `500` | Error while saving file | In case it was not possible to download the file from the specified URL |
-| `500` | Error while parsing file | In case it was not possible to open the file after download |
+| code  | message                       | description                                                             |
+|-------|-------------------------------|-------------------------------------------------------------------------|
+| `422` | No url found in request       | In case the `url` field in the request is missing                       |
+| `422` | No mime_type found in request | In case the `mime_type` field in the request is missing                 |
+| `422` | Unsupported file type         | In case the file is not a PDF                                           |
+| `500` | Error while saving file       | In case it was not possible to download the file from the specified URL |
+| `500` | Error while parsing file      | In case it was not possible to open the file after download             |
 
 
 The body of the response can contain a JSON with the following fields:
@@ -94,7 +85,7 @@ The body of the response can contain a JSON with the following fields:
 
 ## Development
 
-The PDF text extract service is built using [Flask](https://flask.palletsprojects.com/) on Python 3.9.
+The PDF text extract service is built using [FastAPI](https://fastapi.tiangolo.com/) and Python 3.9.
 
 Given the selected stack the development requires:
 
@@ -111,7 +102,7 @@ pip install -r requirements.txt
 Run the local development application using:
 
 ```bash
-python -m flask --app parsing_service run
+fastapi dev text_extractor_api/main.py
 ```
 
 

diff --git a/docker-compose.yaml b/docker-compose.yaml
@@ -1,16 +1,21 @@
 version: '3'
 
 networks:
-  web:
+  internal:
     driver: bridge
 
 services:
   app:
-    image: "ghcr.io/data-house/pdf-text-extractor:main"
-    environment:
-      GUNICORN_WORKERS: 2
-      GUNICORN_WORKERS_TIMEOUT: 600
+    build:
+      context: .
     networks:
-      - web
+        - internal
+    env_file:
+      - .env
     ports:
-      - "5200:5000"
+      - "5002:5000"
+
+  pdfact:
+    image: "ghcr.io/data-house/pdfact:main"
+    networks:
+      - internal
diff --git a/gunicorn.sh b/gunicorn.sh
diff --git a/parsing_service/__init__.py b/parsing_service/__init__.py
diff --git a/parsing_service/implementation/chunk.py b/parsing_service/implementation/chunk.py
diff --git a/parsing_service/implementation/parser_factory.py b/parsing_service/implementation/parser_factory.py
diff --git a/parsing_service/implementation/pdf_parser.py b/parsing_service/implementation/pdf_parser.py
diff --git a/parsing_service/models/parser.py b/parsing_service/models/parser.py
diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,8 @@
-Flask==2.3.2
 pandas==2.0.2
 pymupdf==1.22.5
 numpy~=1.24.3
-requests==2.31.0
-gunicorn==20.1.0; platform_system != "Windows"
+requests==2.32.0
+fastapi~=0.111.0
+pydantic~=2.7.1
+pydantic_settings~=2.2.1
+uvicorn==0.22.0
diff --git a/text_extractor/__init__.py b/text_extractor/__init__.py
@@ -0,0 +1,5 @@
+import logging
+
+from text_extractor.logger import init_logger
+
+logger = logging.getLogger(__name__)
diff --git a/parsing_service/logger.py → text_extractor/logger.py b/parsing_service/logger.py → text_extractor/logger.py
diff --git a/text_extractor/models/__init__.py b/text_extractor/models/__init__.py
@@ -0,0 +1,5 @@
+from .color import Color
+from .document import Document
+from .font import Font
+from .paragraph import Paragraph, Metadata
+from .position import Position