Skip to content

Commit

Permalink
Merge branch 'develop' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
royjohal committed Sep 16, 2019
2 parents 9040be4 + cf865aa commit a56ca58
Show file tree
Hide file tree
Showing 196 changed files with 19,314 additions and 1,142 deletions.
105 changes: 5 additions & 100 deletions .drone.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,14 @@ platform:
os: linux
arch: amd64

node:
memory: high

steps:
- name: Change file ownership
image: alpine:latest
commands:
- chown -R 1001:0 /drone/src

- name: Build project
image: axarev/documentparser
image: axarev/parsr
environment:
LD_LIBRARY_PATH: /opt/rh/rh-nodejs8/root/usr/lib64
NODE_ENV: development
Expand All @@ -25,7 +22,7 @@ steps:
- npm install

- name: Run formatter
image: axarev/documentparser
image: axarev/parsr
environment:
LD_LIBRARY_PATH: /opt/rh/rh-nodejs8/root/usr/lib64
commands:
Expand All @@ -38,36 +35,17 @@ steps:
- npm run lint

- name: Run tests
image: axarev/documentparser
image: axarev/parsr
environment:
LD_LIBRARY_PATH: /opt/rh/rh-nodejs8/root/usr/lib64
commands:
- export PATH=/opt/rh/rh-nodejs8/root/usr/bin:$PATH
- npm run test

- name: Code-analysis
image: aosapps/drone-sonar-plugin:1.0
settings:
sonar_host:
from_secret: sonar_host
sonar_token:
from_secret: sonar_token
when:
branch:
- master

- name: Tag with demo
image: busybox
commands:
- echo demo > .tags
when:
branch:
- demo

- name: Build Docker image
image: plugins/docker
settings:
repo: axarev/documentparser
repo: axarev/parsr
context: .
dockerfile: docker/parsr/Dockerfile
username:
Expand All @@ -76,84 +54,11 @@ steps:
from_secret: registry_password
build_args:
DEV_MODE: 'true'
# auto_tag: true
when:
branch:
- develop
- demo
event:
exclude:
- pull_request

- name: Deploy dev
image: docker
environment:
DOCKER_HOST:
from_secret: docker_host
CA:
from_secret: docker_ca
CLIENT_CERT:
from_secret: docker_cert
CLIENT_KEY:
from_secret: docker_key
DOCKER_CERT_PATH: /cert
DOCKER_TLS_VERIFY: 1
DOCKER_IMAGE: axarev/documentparser:latest
DOCKER_SERVICE: documentparser_documentparser-dev
REGISTRY_USER:
from_secret: registry_user
REGISTRY_PASSWORD:
from_secret: registry_password
commands:
- mkdir -p "$DOCKER_CERT_PATH"
- echo "$CA" > $DOCKER_CERT_PATH/ca.pem
- echo "$CLIENT_CERT" > $DOCKER_CERT_PATH/cert.pem
- echo "$CLIENT_KEY" > $DOCKER_CERT_PATH/key.pem
- docker login -u "$REGISTRY_USER" -p"$REGISTRY_PASSWORD"
- docker service update --with-registry-auth --image $DOCKER_IMAGE $DOCKER_SERVICE
- rm -rf $DOCKER_CERT_PATH
when:
branch:
- develop
- drone-ci
- feature/drone*
event:
exclude:
- pull_request

- name: Deploy demo
image: docker
environment:
DOCKER_HOST:
from_secret: docker_host
CA:
from_secret: docker_ca
CLIENT_CERT:
from_secret: docker_cert
CLIENT_KEY:
from_secret: docker_key
DOCKER_CERT_PATH: /cert
DOCKER_TLS_VERIFY: 1
DOCKER_IMAGE: axarev/documentparser:demo
DOCKER_SERVICE: documentparser_parsr-demo
REGISTRY_USER:
from_secret: registry_user
REGISTRY_PASSWORD:
from_secret: registry_password
commands:
- mkdir -p "$DOCKER_CERT_PATH"
- echo "$CA" > $DOCKER_CERT_PATH/ca.pem
- echo "$CLIENT_CERT" > $DOCKER_CERT_PATH/cert.pem
- echo "$CLIENT_KEY" > $DOCKER_CERT_PATH/key.pem
- docker login -u "$REGISTRY_USER" -p"$REGISTRY_PASSWORD"
- docker service update --with-registry-auth --image $DOCKER_IMAGE $DOCKER_SERVICE
- rm -rf $DOCKER_CERT_PATH
when:
branch:
- demo
event:
exclude:
- pull_request


image_pull_secrets:
- dockerconfigjson
3 changes: 2 additions & 1 deletion .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@
"-p"
],
"env": {
"NODE_DEBUG": "pipeline"
"NODE_DEBUG": "pipeline",
"GOOGLE_APPLICATION_CREDENTIALS": "${workspaceRoot}/***.json"
},
"outputCapture": "std"
}
Expand Down
42 changes: 34 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

# Parsr: Turn your documents into data!

[中文](README_zh-cn.md)

**Parsr**, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.
Expand Down Expand Up @@ -58,13 +60,15 @@ Under a **Debian** based distribution:
```sh
sudo add-apt-repository ppa:ubuntuhandbook1/apps
sudo apt-get update
sudo apt-get install nodejs npm qpdf imagemagick pdf2json tesseract-ocr libtesseract-dev
sudo apt-get install nodejs npm qpdf imagemagick pdf2json python-pdfminer tesseract-ocr libtesseract-dev python3-tk ghostscript python3-pip
pip install camelot-py
```

Under **Arch** Linux :

```sh
pacman -S nodejs npm qpdf imagemagick pdf2json tesseract
pacman -S nodejs npm qpdf imagemagick pdf2json pdfminer tesseract python-pip
pip install camelot-py
```

#### 1.2.2. Installing Dependencies under MacOS
Expand All @@ -82,6 +86,21 @@ Next, install the required dependencies:
brew install node qpdf imagemagick pdf2json tesseract tesseract-lang
```

To install the python based depedencies (pdfminer and camelot), install, first install `pip`:

```sh
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
```

and then the dependencies:

```sh
pip install pdfminer.six
pip install python-tk ghostscript camelot-py

```

#### 1.2.3. Installing Dependencies under Windows

1. We recommand using [Chocolatey](https://chocolatey.org) as the package manager for installing dependencies under Windows. To install Chocolatey, [follow these instructions](https://chocolatey.org/install#installing-chocolatey).
Expand All @@ -93,6 +112,8 @@ brew install node qpdf imagemagick pdf2json tesseract tesseract-lang
```

3. [Download and install **`node.js`**](https://nodejs.org/en/download)
4. For table detection, install [**camelot**](https://camelot-py.readthedocs.io/en/master/user/install-deps.html#for-windows).
5. For the **pdfminer** extractor for pdfs, [follow these steps](https://github.com/pdfminer/pdfminer.six#how-to-install).

##### 1.2.3.1. pdf2json

Expand All @@ -112,6 +133,10 @@ You can download Tesseract 4.0 64-bit for Windows or check out other available f
Then, you need to add tesseract.exe to your PATH:
If you have install it in `C:\Program Files (x86)\Tesseract-OCR`, you can either add it [using the user interface](https://docs.alfresco.com/4.2/tasks/fot-addpath.html) execute the following command in Powershell (Run as Administrator):

```sh
setx PATH "\$env:PATH;C:\Program Files (x86)\Tesseract-OCR" -m
```

### 1.3. Optional Dependencies

The following dependencies are **completely optional**, and their exclusion does not hinder the proper functioning of the Parsr pipeline.
Expand Down Expand Up @@ -216,10 +241,10 @@ The tool contains a pipeline of modules that process the document step by step a
To start the web viewer demo, simply run:
```sh
npm run start:web
npm run start:web:vue
```
Then, open [localhost:3000](http://localhost:3000) with your favorite browser.
Then, open [localhost:8080](http://localhost:8080) with your favorite browser.
#### 2.2.3. Command Line Usage
Expand Down Expand Up @@ -321,10 +346,11 @@ Third Party Libraries licenses :
1. **QPDF**: Apache [http://qpdf.sourceforge.net](http://qpdf.sourceforge.net/)
2. **ImageMagick**: Apache 2.0 [https://imagemagick.org/script/license.php](https://imagemagick.org/script/license.php)
3. **Pdf2json**: Apache 2.0 [https://github.com/modesty/pdf2json/blob/scratch/quadf-forms/license.txt](https://github.com/modesty/pdf2json/blob/scratch/quadf-forms/license.txt)
4. **Tesseract**: Apache 2.0 [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
5. **Camelot**: MIT [https://github.com/camelot-dev/camelot](https://github.com/camelot-dev/camelot)
6. **MuPDF** (Optional dependency): AGPL [https://mupdf.com/license.html](https://mupdf.com/license.html)
7. **Pandoc** (Optional dependency): GPL [https://github.com/jgm/pandoc](https://github.com/jgm/pandoc)
4. **Pdfminer.six**: MIT [https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE](https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE)
5. **Tesseract**: Apache 2.0 [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
6. **Camelot**: MIT [https://github.com/camelot-dev/camelot](https://github.com/camelot-dev/camelot)
7. **MuPDF** (Optional dependency): AGPL [https://mupdf.com/license.html](https://mupdf.com/license.html)
8. **Pandoc** (Optional dependency): GPL [https://github.com/jgm/pandoc](https://github.com/jgm/pandoc)
## 7. License
Expand Down
Loading

0 comments on commit a56ca58

Please sign in to comment.