Skip to content

Latest commit

 

History

History
107 lines (85 loc) · 4.84 KB

README.md

File metadata and controls

107 lines (85 loc) · 4.84 KB

RESTful HTTP API for Pandoc

Pandoc is an amazing universal document converter. Unfortunately, it just has a command-line interface. In this project, we enable the usage of Pandoc via a RESTful HTTP API, provide a mapping of Pandoc type identifiers to common media types, and wrap everything in a docker container, so that it can be easily used/deployed.

API

The API only allows POST requests. The data to be converted must be passed in the request body. The header field Content-Type specifies the input type and the header field Accept specifies the output type.

Media Type Mapping

Since Pandoc uses its own type identifiers for input and output format, we created a mapping between the Pandoc identifiers and the corresponding media types. For instance, the Pandoc identifier html maps to the media type text/html.

The mapping is incomplete since there does not exist a media type for every format supported by Pandoc. Therefore, you can also use the Pandoc identifiers in Content-Type and Accept but this is not compliant with the HTTP specification. To be compliant, we support the usage of application/x. as a prefix in front of a Pandoc identifier. This prefix is the official media type tree for unregistered types.

Installation with Docker

To simplify the usage of this project, we wrapped everything into a docker container that can easily be deployed on any machine.

Pandoc uses latex to create pdfs. Since the latex dependencies add roughly 2gb to the docker image, we decided to create two images:

  • dwolters/pandoc-http:latest does not include latex and is therefore unable to create pdfs (uncompressed ~700mb, compressed ~280mb). The :latest tag is added by default if no tag is specified.
  • dwolters/pandoc-http:latex includes latex and be used to create pdfs (uncompressed ~2.7gb, compressed ~2gb). It takes a while to build or pull this image.

You can build the image yourself:

docker build -t dwolters/pandoc-http .

Or install it via docker hub:

docker pull dwolters/pandoc-http

Afterwards, you can start the container:

docker run -d -p 8080:80 --name my-pandoc-http dwolters/pandoc-http

Within the container the HTTP API is reachable on port 80. In the command above the HTTP API is bound to port 8080 of the docker host.

You can stop and remove the container if it is not needed anymore:

docker stop my-pandoc-http
docker rm my-pandoc-http

Installation without Docker

In order to use this project without using the docker container, you first must install Pandoc and add it to your PATH. Alternatively, you can set the PANDOC env variable to define the location of your pandoc executable.

Afterwards, clone the repository and switch to the proper directory:

git clone https://github.com/dwolters/pandoc-http
cd pandoc-http

Install the dependencies:

npm install

And finally, you can start the HTTP API for Pandoc:

node server.js

The API can run on a different port by setting the PORT environment variable, e.g., on port 8080:

PORT=8080 node server.js

Example API Call

Assuming the API listens on port 8080, you can test it by using curl. The following command shows how to convert html into markdown using our HTTP API for Pandoc:

curl -s -H "Content-Type: text/html" -H "Accept: text/markdown" --data "<h1>My Headline</h1>"  http://localhost:8080/
curl -s -H "Content-Type: text/html" -H "Accept: docx" --data "<h1>My Headline</h1>"  http://localhost:8080/ > file.docx
curl -s -H "Content-Type: docx" -H "Accept: text/markdown" --data-binary "@file.docx"  http://localhost:8080/

Please note that in this example the pandoc identifier for docx files is used. The correct media type would be application/vnd.openxmlformats-officedocument.wordprocessingml.document.

Swagger Description

The script generate-swagger-spec.js automatically generates the Swagger description for this service based on the supported input and output formats (listed by pandoc --list-[input|output]-formats respectively). The Swagger description can be generated in both YAML and JSON format. The npm scripts generate-swagger-json and generate-swagger-yaml can be used to output the generated description into a file with a fixed filename (pandoc.swagger.json or pandoc.swagger.yaml respectively). To save the description into a file with custom filename, run

node generate-swagger-spec.js [--json|--yaml] > your-filename-here.ext

Acknowledgements

The Dockerfile is partially based on the Dockerfile of vpetersson's pandoc container.