Pandoc is an amazing universal document converter. Unfortunately, it just has a command-line interface. In this project, we enable the usage of Pandoc via a RESTful HTTP API, provide a mapping of Pandoc type identifiers to common media types, and wrap everything in a docker container, so that it can be easily used/deployed.
The API only allows POST requests. The data to be converted must be
passed in the request body. The header field Content-Type
specifies
the input type and the header field Accept
specifies the output type.
Since Pandoc uses its own type identifiers for input and output format, we
created a mapping between the Pandoc identifiers and
the corresponding media types. For instance, the Pandoc identifier html
maps to the media type text/html
.
The mapping is incomplete since there does not exist
a media type for every format supported by Pandoc. Therefore, you can
also use the Pandoc identifiers in Content-Type
and Accept
but this
is not compliant with the HTTP specification. To be compliant, we support
the usage of application/x.
as a prefix in front of a Pandoc identifier.
This prefix is the official media type tree for unregistered types.
To simplify the usage of this project, we wrapped everything into a docker container that can easily be deployed on any machine.
Pandoc uses latex to create pdfs. Since the latex dependencies add roughly 2gb to the docker image, we decided to create two images:
dwolters/pandoc-http:latest
does not include latex and is therefore unable to create pdfs (uncompressed ~700mb, compressed ~280mb). The:latest
tag is added by default if no tag is specified.dwolters/pandoc-http:latex
includes latex and be used to create pdfs (uncompressed ~2.7gb, compressed ~2gb). It takes a while to build or pull this image.
You can build the image yourself:
docker build -t dwolters/pandoc-http .
Or install it via docker hub:
docker pull dwolters/pandoc-http
Afterwards, you can start the container:
docker run -d -p 8080:80 --name my-pandoc-http dwolters/pandoc-http
Within the container the HTTP API is reachable on port 80. In the command above the HTTP API is bound to port 8080 of the docker host.
You can stop and remove the container if it is not needed anymore:
docker stop my-pandoc-http
docker rm my-pandoc-http
In order to use this project without using the docker container, you first
must install Pandoc and add it to your PATH
.
Alternatively, you can set the PANDOC
env variable to define the location of your pandoc executable.
Afterwards, clone the repository and switch to the proper directory:
git clone https://github.com/dwolters/pandoc-http
cd pandoc-http
Install the dependencies:
npm install
And finally, you can start the HTTP API for Pandoc:
node server.js
The API can run on a different port by setting the PORT environment variable, e.g., on port 8080:
PORT=8080 node server.js
Assuming the API listens on port 8080, you can test it by using curl. The following command shows how to convert html into markdown using our HTTP API for Pandoc:
curl -s -H "Content-Type: text/html" -H "Accept: text/markdown" --data "<h1>My Headline</h1>" http://localhost:8080/
curl -s -H "Content-Type: text/html" -H "Accept: docx" --data "<h1>My Headline</h1>" http://localhost:8080/ > file.docx
curl -s -H "Content-Type: docx" -H "Accept: text/markdown" --data-binary "@file.docx" http://localhost:8080/
Please note that in this example the pandoc identifier for docx files is used. The correct media type would be application/vnd.openxmlformats-officedocument.wordprocessingml.document
.
The script generate-swagger-spec.js
automatically generates the Swagger description for this service based on the supported input and output formats (listed by pandoc --list-[input|output]-formats
respectively). The Swagger description can be generated in both YAML and JSON format. The npm scripts generate-swagger-json
and generate-swagger-yaml
can be used to output the generated description into a file with a fixed filename (pandoc.swagger.json
or pandoc.swagger.yaml
respectively). To save the description into a file with custom filename, run
node generate-swagger-spec.js [--json|--yaml] > your-filename-here.ext
The Dockerfile is partially based on the Dockerfile of vpetersson's pandoc container.