Skip to content

xyz65535/reverse_adoc

 
 

Repository files navigation

AsciiDoc from HTML and Microsoft Word: reverse_adoc

Purpose

Transforms HTML and Microsoft Word into AsciiDoc.

Installation

Install the gem:

[sudo] gem install reverse_adoc

or add it to your Gemfile:

gem 'reverse_adoc'

Command-line usage

HTML to AsciiDoc: reverse_adoc

Convert HTML files to AsciiDoc:

$ reverse_adoc file.html > file.adoc
$ cat file.html | reverse_adoc > file.adoc

Microsoft Word to AsciiDoc: w2a

Convert Word .doc or .docx files to AsciiDoc:

$ w2a file.docx > file.adoc
$ w2a input.docx -o output.adoc

Help:

$ w2a -h
Usage: w2a [options] <file>
    -a, --mathml2asciimath           Convert MathML to AsciiMath
    -o, --output=FILENAME            Output file to write to
    -e, --external-images            Export images if data URI
    -v, --version                    Version information
    -h, --help                       Prints this help
Note
w2a requires LibreOffice to be installed. It uses LibreOffice’s export to XHTML. LibreOffice’s export of XHTML is superior to the native Microsoft Word export to HTML: it exports lists (which Word keeps as paragraphs), and it exports OOMML into MathML. On the other hand, the LibreOffice export relies on default styling being used in the document, and it may not cope with ordered lists or headings with customised appearance. For best results, reset the styles in the document you’re converting to those in the default Normal.dot template.
Note
w2a requires the command-line version of LibreOffice, soffice. As it turns out, LibreOffice v6 appears to render formulae in HTML as images instead of MathML expressions; use LibreOffice v5. If you have both LibreOffice v5 and LibreOffice v6 installed, make sure that your OS path searches for the LibreOffice v5 version of soffice first; e.g. on Mac, include something like /Applications/LibreOffice5.4.7.2.app/Contents/MacOS in your PATH environment.
Note
Some information in OOMML is not preserved in the export to MathML from LibreOffice; in particular, font shifts such as double-struck fonts. The LibreOffice exporter does seem to drop some text (possibly associated with MathML); use with caution.
Note
Adapted from w2m of Ben Balter’s word-to-markdown

Common options

MathML to AsciiMath conversion

If you wish to convert the MathML in the document to AsciiMath, run the script with the --mathml2asciimath option:

$ w2a --mathml2asciimath document.docx > document.adoc

Extracting images

Images referred by the HTML can be extracted into the destination output folder by using:

$ reverse_adoc input.docx -o output/file.adoc -e
$ reverse_adoc input.docx --output output/file.adoc --external-images

Word embedded images can be extracted into the destination output folder by using:

$ w2a input.docx -o output/file.adoc -e
$ w2a input.docx --output output/file.adoc --external-images

Handling unknown HTML tags

The --unknown_tags option allows you to specify how to handle unknown tags (default pass_through).

Valid options are:

  • pass_through - Include the unknown tag completely into the result

  • drop - Drop the unknown tag and its content

  • bypass - Ignore the unknown tag but try to convert its content

  • raise - Raise an error to let you know

Tagging of borders

Specify how to handle tag borders with the option --tag_border (default ' ').

Valid options are:

  • ' ' - Add whitespace if there is none at tag borders.

  • '' - Do not not add whitespace.

Features

General

reverse_adoc shares features as a port of reverse_markdown:

  • Module based — if you miss a tag, just add it

  • Can deal with nested lists

  • Inline and block code is supported

  • Supports blockquote

It supports the following HTML tags (these are supported by reverse_markdown):

  • a

  • blockquote

  • br

  • code, tt (added: kbd, samp, var)

  • div, article

  • em, i (added: cite)

  • h1, h2, h3, h4, h5, h6, hr

  • img

  • li, ol, ul (added: dir)

  • p, pre

  • strong, b

  • table, td, th, tr

Note
  • reverse_adoc does not support del or strike, because Asciidoctor does not out of the box.

  • As with reverse_markdown, pre is only treated as sourcecode if it is contained in a div@class = highlight- element, or has a @brush attribute naming the language (Confluence).

  • The gem does not support p@align, because Asciidoctor doesn’t

In addition, it supports:

  • aside

  • audio, video (with @src attributes)

  • figure, figcaption

  • mark

  • q

  • sub, sup

  • @id anchors

  • blockquote@cite

  • img/@width, img/@height

  • ol/@style, ol/@start, ol/@reversed, ul/@type

  • td/@colspan, td/@rowspan, td@/align, td@/valign

  • table/caption, table/@width, table/@frame (partial), table/@rules (partial)

  • Lists and paragraphs within cells

    • Not tables within cells: Asciidoctor cannot deal with nested tables

The gem does not support:

  • col, colgroup

  • source, picture

  • bdi, bdo, ruby, rt, rp, wbr

  • frame, frameset, iframe, noframes, noscript, script, input, output, progress

  • map, canvas, dialog, embed, object, param, svg, track

  • fieldset, button, datalist, form, label, legend, menu, menulist, optgroup, option, select, textarea

  • big, dfn, font, s, small, span, strike, u

  • center

  • data, meter

  • del, ins

  • footer, header, main, nav, details, section, summary, template

MathML support

If you are using this gem in the context of Metanorma, Metanorma AsciiDoc accepts MathML as a native mathematical format. So you do not need to convert the MathML to AsciiMath.

The gem will optionally invoke the https://github.com/metanorma/mathml2asciimath gem, to convert MathML to AsciiMath. The conversion is not perfect, and will need to be post-edited; but it’s a lot better than nothing.

Note
Asciidoctor does not support MathML input. HTML uses MathML. The gem will recognize MathML expressions in HTML, and will wrap them in Asciidoctor \$ \$ macros. The result of this gem is not actually legal Asciidoctor for stem: Asciidoctor will presumably think this is AsciiMath in the \$ \$ macro, try to pass it into MathJax as AsciiMath, and fail. But of course, MathJax has no problem with MathML, and some postprocessing on the Asciidoctor output can ensure that the MathML is treated by MathJax (or whatever else uses the output) as such; so this is still much better than nothing for stem processing.

Word cleanup

This gem is routinely used in the Metanorma project to export Word documents to AsciiDoc. The HTML export from Word that the gem uses, from LibreOffice, is much cleaner than the native HTML 4 export from Word; but it has some infelicities which this gem cleans up:

  • The HTML export has trouble with subscripts, and routinely exports them as headings; the w2a script tries to clean them up.

  • The w2a cleans up spaces, but it does not strip them.

  • Spaces are removed from anchors and cross-references.

  • Double underscores are removed from anchors and cross-references.

  • Cross-references to _GoBack and to _Toc followed by numbers (used to construct tables of contents) are ignored.

Ruby library usage

General

Simple to use.

result = ReverseAdoc.convert input
result.inspect # " *feelings* "

Configure with options

Just pass your chosen configuration options in after the input. The given options will last for this operation only.

ReverseAdoc.convert(input, unknown_tags: :raise, mathml2asciimath: true)

Preconfigure using an initializer

Or configure it block style on a initializer level. These configurations will last for all conversions until they are set to something different.

ReverseAdoc.config do |config|
  config.unknown_tags      = :bypass
  config.mathml2asciimath  = true
  config.tag_border  = ''
end

About

Convert HTML/Word DOCX into AsciiDoc syntax

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 83.4%
  • HTML 16.6%