This library is a Lua binding for lol-html, a Low output latency streaming HTML parser/rewriter with CSS selector-based API.
It can be used to either extract data from HTML documents or rewrite them on-the-fly.
You need a functional setup of Rust and Cargo to be able to build this module. Please refer to the Rust website or install it with your distribution's package manager.
You can install this module with Luarocks:
luarocks install https://raw.githubusercontent.com/jdesgats/lua-lolhtml/master/rockspecs/lolhtml-dev-1.rockspec
First, be sure to clone this repository with its submodules. Then the provided Makefile should be able to build the module.
git clone --recursive https://github.com/jdesgats/lua-lolhtml.git
make
Running the tests require my fork of Telescope:
luarocks install https://raw.githubusercontent.com/jdesgats/telescope/master/rockspecs/telescope-scm-1.rockspec
tsc spec/lolhtml.lua
The workflow is usually:
- Create a rewriter builder object:
local lolhtml = require "lolhtml" local my_builder = lolhtml.new_rewriter_builder()
- Attach callbacks to it with the logic to transform your documents:
my_builder:add_element_content_handlers { selector = lolhtml.new_selector("h1"), element_handler = function(el) el:set_attribute("class", "title") end }
- Use the previous builder to create rewriter objects,
one for each HTML page you want to work on:
local my_rewriter = lolhtml.new_rewriter { builder = my_builder, sink = function(s) print(s) end, }
- Feed the rewriter with the actual HTML stream:
for l in io.stdin:lines() do my_rewriter:write(l) end my_rewriter:close()
The examples
directory contains a port of the original Rust examples from
lol-html. You can run them by feeding an HTML page as input:
curl -NL https://git.io/JeOSZ | lua examples/defer_scripts.lua
ALPHA VERSION
This binding is not finished yet. Even if the test coverage is quite good and pass and Valgrind is not complaining, bugs might still be present.
Also, the API is dot frozen and might change. Here are a non-exhaustive list of things that I still consider:
- API naming: stay close of the original names, or choose shorter ones
- Selectors: should they be exposed at all? or compiled and cached transparently
- Some data could be exposed as attributes rather than methods, is it better?
- Tables vs. lots of arguments for some functions
- Error handling: when to raise errors, when to return
nil, err
This library tries to stay close of the original API, while being more Lua-ish
when appropriate. In particular it should not panic (as in triggering
SIGABRT
), such case would be considered as a bug.
Object constructors:
lolhtml.new_selector
: seeSelector
lolhtml.new_rewriter_builder
: seeRewriterBuilder
lolhtml.new_rewriter
: seeRewriter
Constants:
lolhtml.CONTINUE
lolhtml.STOP
Selector object represent a parsed CSS selector that can be used to build rewriter builders.
Selector objects don't have any methods or attributes. They are exposed only for garbage collection purposes (and also as an optimization if you need to reuse the same selector in multiple builders).
Builds a new Selector
object out of the give string.
Returns nil, err
in case of syntax error.
The RewriterBuilder
encapsulate the logic to make rewrites, usually they are
created at program startup and are used to instantiate many Rewriter
objects.
All callbacks functions are called with a single argument whose type depend on the type of callback. This argument should not outlive the callback and any attempt to keep a reference of it to use it later will result in an error.
These functions can return:
lolhtml.CONTINUE
: instructs the parser to continue processing the HTML streamlolhtml.STOP
: causes the parser to stop immediately,write()
orend()
methods of the rewriter will return an error code- nothing: same as
lolhtml.CONTINUE
If a callback raises an error, it will also causes the rewriter to stop
immediately. The error object or message will be returned as error by the
write()
or end()
methods of the rewriter.
Create a new RewriterBuilder
object.
Adds new document-level content handlers. This function might be called multiple times to add multiple handlers.
The callback
parameter must be a table with callbacks for different types
of events, the possible fields are:
doctype_handler
: called after parsing the Document Type declaration with aDoctype
object.comment_handler
: called whenever a comment is parsed with aComment
object.text_handler
: called when text nodes are parsed with aTextChunk
object.doc_end_handler
: called at the end of the document with aDocumentEnd
object.
All of the fields are optional. Calling a callback has a cost so leave out any callback you don't need.
Adds new element content handlers associated with a selector. This function might be called multiple times to add multiple handlers for different selectors.
The callback
parameter must be a table with the selector and the callbacks
for different types of events, the possible fields are:
selector
: the CSS selector to call the callbacks on (required)comment_handler
: called whenever a comment is parsed with aComment
object.text_handler
: called when text nodes are parsed with aTextChunk
object.element_handler
: called when an element is parsed with aElement
object.
All of the fields are optional (except selector
). Calling a callback has a
cost so leave out any callback you don't need.
Rewriter object are processing a single HTML document and are instantiated with
a RewriterBuilder
object.
Each rewriter has an associated sink
, which is a function called to output
the rewritten HTML.
Creates a new reriter object. The options
argument must be a table, the
following fields are allowed:
builder
: aRewriterBuilder
object (required)encoding
: the text encoding for the HTML stream. Can be a label for any of the web-compatible encodings with an exception forUTF-16LE
,UTF-16BE
,ISO-2022-JP
andreplacement
(these non-ASCII-compatible encodings are not supported). (optional, default is"utf-8"
)preallocated_parsing_buffer_size
: Specifies the number of bytes that should be preallocated on HtmlRewriter instantiation for the internal parsing buffer. See lol-html documentation for details. (optional, default is 1024)max_allowed_memory_usage
: Sets a hard limit in bytes on memory consumption of a Rewriter instance. See lol-html documentation for details. (optional, default isSIZE_MAX
)strict
: boolean, if set to true the rewriter bails out if it encounters markup that drives the HTML parser into ambigious state. See lol-html documentation for details. (optional, default isfalse
)
Returns the new Rewriter on success, or nil
and an error message on failure.
Write HTML chunk to rewriter. Returns the rewriter itself on success, or nil
and an error message on failure. Failure happens if (incomplete list):
- A callback or a sink raises an error
- A previous invocation returned an error
- Called after
close
Finalizes the rewriting process. Should be called once the last chunk of the
input is written. Returns the rewriter itself on success, or nil
and an
error message on failure. Failure happens if (incomplete list):
- A callback or a sink raises an error
- A previous invocation returned an error
- Called more than once
Returns a Lua iterator triplet so the following construction is valid:
for attr_name, value in element:attribute() do
...
end