Skip to content

Commit

Permalink
FastAPI and PDFAct new implementation (#2)
Browse files Browse the repository at this point in the history
* Replace Flask with FastAPI

* Add PDFAct driver

---------

Co-authored-by: AnnaMarika01 <[email protected]>
  • Loading branch information
andreaponti5 and AnnaMarika01 authored Jun 4, 2024
1 parent 24103fa commit b18be23
Show file tree
Hide file tree
Showing 33 changed files with 485 additions and 235 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ jobs:
fetch-depth: 1

- name: Lint the Shell scripts
run: shellcheck ./gunicorn.sh
run: shellcheck ./uvicorn.sh
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
.idea
logs
.env
10 changes: 6 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,15 @@ WORKDIR /app
COPY --from=build-image /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY parsing_service/ parsing_service/
COPY root.py gunicorn.sh ./

RUN chmod +x ./gunicorn.sh
COPY text_extractor_api/ text_extractor_api/
COPY text_extractor/ text_extractor/
COPY root.py uvicorn.sh ./

RUN chmod +x ./uvicorn.sh

EXPOSE 5000/tcp

ENTRYPOINT ["tini", "--"]

CMD ["/app/gunicorn.sh"]
CMD ["/app/uvicorn.sh"]
73 changes: 32 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
[![CI](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml) [![Build Docker Image](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml)

# PDF Text extraction service for Data House
# PDF Text Extraction Service

Extract text from PDFs keeping page information.
A FastAPI application to extract text from pdf documents.

## Getting started

Expand All @@ -18,18 +18,9 @@ A sample [`docker-compose.yaml` file](./docker-compose.yaml) is available within
> Please refer to [Releases](https://github.com/data-house/pdf-text-extractor/releases) and [Packages](https://github.com/data-house/pdf-text-extractor/pkgs/container/pdf-text-extractor) for the available tags.

**Available environment variables**

| variable | default | description |
|------|---------|-------------|
| `GUNICORN_WORKERS` | 2 | The number of [Gunicorn](https://docs.gunicorn.org/en/latest/settings.html#worker-class) sync workers |
| `GUNICORN_WORKERS_TIMEOUT` | 600 | The timeout, in seconds, of each worker |



## Usage

The PDF Text Extract service expose a web application on port `5000`. The available API receive a PDF file via a URL and return the extracted text as a JSON response.
The PDF Text Extract service expose a web application. The available API receive a PDF file via a URL and return the extracted text as a JSON response.

The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front.

Expand All @@ -38,44 +29,44 @@ The exposed service is unauthenticated therefore consider exposing it only withi
The service expose only one endpoint `/extract-text` that accepts a `POST` request
with the following input as a `json` body:

- `url` the URL of the PDF file to process
- `mime_type` the mime type of the file (it is expected to be `application/pdf`)
- `url`: the URL of the PDF file to process.
- `mime_type`: the mime type of the file (it is expected to be `application/pdf`).
- `driver`: two drivers are currently implemented `pymupdf` and `pdfact`. It defines the extraction backend to use.

> **warning** The processing is performed synchronously

The response will be a JSON containing:

- `status` the status of the operation. Usually `ok`.
- `content` a list of objects describing the chunked content with the page reference. Each object contains a `text` property with the part of the PDF text and a `metadata` object with the `page_number` property representing the page of the PDF from which the `text` was extracted.
The response is a JSON with the extracted text splitted in chunks. In particular, the structure is as follows:

The following code block shows a possible output:
- `text`: The list of chunks, each composed by:
- `text`: The text extracted from the chunk.
- `metadata`: A json with additional information regarding the chunk.
- `fonts`: The list of fonts used in the document.
Each font is represented by `name`, `id`, `is-bold`, `is-type3` and `is-italic`.
Available only using `pdfact` driver.
- `colors`: The list of colors used in the document.
Each color is represented by `r`, `g`, `b` and `id`.
Available only using `pdfact` driver.

```json
{
"status": "ok",
"content": [
{
"text": "This is a test PDF to be used as input in unit tests",
"metadata": {
"page_number": 1
}
}
]
}
```
The `metadata` of each chunk contains the following information:
- `page`: The page number from which the chunk has been extracted.
- `role`: The role of the chunk in the document (e.g., _heading_, _body_, etc.)
- `positions`: A list of bounding box containing the text.
Each bounding box is identified by 4 coordinated: `minY`, `minX`, `maxY` and `maxX`.
- `font`: The font of the chunk.
- `color`: The color of the chunk.

### Error handling

The service can return the following errors

| code | message | description |
|------|---------|-------------|
| `422` | No url found in request | In case the `url` field in the request is missing |
| `422` | No mime_type found in request | In case the `mime_type` field in the request is missing |
| `422` | Unsupported file type | In case the file is not a PDF |
| `500` | Error while saving file | In case it was not possible to download the file from the specified URL |
| `500` | Error while parsing file | In case it was not possible to open the file after download |
| code | message | description |
|-------|-------------------------------|-------------------------------------------------------------------------|
| `422` | No url found in request | In case the `url` field in the request is missing |
| `422` | No mime_type found in request | In case the `mime_type` field in the request is missing |
| `422` | Unsupported file type | In case the file is not a PDF |
| `500` | Error while saving file | In case it was not possible to download the file from the specified URL |
| `500` | Error while parsing file | In case it was not possible to open the file after download |


The body of the response can contain a JSON with the following fields:
Expand All @@ -94,7 +85,7 @@ The body of the response can contain a JSON with the following fields:

## Development

The PDF text extract service is built using [Flask](https://flask.palletsprojects.com/) on Python 3.9.
The PDF text extract service is built using [FastAPI](https://fastapi.tiangolo.com/) and Python 3.9.

Given the selected stack the development requires:

Expand All @@ -111,7 +102,7 @@ pip install -r requirements.txt
Run the local development application using:

```bash
python -m flask --app parsing_service run
fastapi dev text_extractor_api/main.py
```


Expand Down
19 changes: 12 additions & 7 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,21 @@
version: '3'

networks:
web:
internal:
driver: bridge

services:
app:
image: "ghcr.io/data-house/pdf-text-extractor:main"
environment:
GUNICORN_WORKERS: 2
GUNICORN_WORKERS_TIMEOUT: 600
build:
context: .
networks:
- web
- internal
env_file:
- .env
ports:
- "5200:5000"
- "5002:5000"

pdfact:
image: "ghcr.io/data-house/pdfact:main"
networks:
- internal
2 changes: 0 additions & 2 deletions gunicorn.sh

This file was deleted.

80 changes: 0 additions & 80 deletions parsing_service/__init__.py

This file was deleted.

10 changes: 0 additions & 10 deletions parsing_service/implementation/chunk.py

This file was deleted.

19 changes: 0 additions & 19 deletions parsing_service/implementation/parser_factory.py

This file was deleted.

41 changes: 0 additions & 41 deletions parsing_service/implementation/pdf_parser.py

This file was deleted.

20 changes: 0 additions & 20 deletions parsing_service/models/parser.py

This file was deleted.

8 changes: 5 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Flask==2.3.2
pandas==2.0.2
pymupdf==1.22.5
numpy~=1.24.3
requests==2.31.0
gunicorn==20.1.0; platform_system != "Windows"
requests==2.32.0
fastapi~=0.111.0
pydantic~=2.7.1
pydantic_settings~=2.2.1
uvicorn==0.22.0
5 changes: 5 additions & 0 deletions text_extractor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import logging

from text_extractor.logger import init_logger

logger = logging.getLogger(__name__)
File renamed without changes.
5 changes: 5 additions & 0 deletions text_extractor/models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .color import Color
from .document import Document
from .font import Font
from .paragraph import Paragraph, Metadata
from .position import Position
Loading

0 comments on commit b18be23

Please sign in to comment.